Re: [PATCH RFC] btrfs: scrub: Mandatory RO block group for device replace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 22, 2020 at 8:37 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>
> [BUG]
> btrfs/06[45] btrfs/071 could fail by finding csum error.
> The reproducibility is not high, around 1/20~1/100, needs to run them in
> loops.
>
> And the profile doesn't make much difference, SINGLE/SINGLE can also
> reproduce the problem.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Device replace reuses scrub code to iterate existing extents.
>
> It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
> that scrub read can write the verified data to target device.
>
> Device replace also utilizes "write duplication" to write new data to
> both source and target device.
>
> However those two write can conflict and may lead to data corruption:
> - Scrub writes old data from commit root
>   Both extent location and csum are fetched from commit root, which
>   is not always the up-to-date data.
>
> - Write duplication is always duplicating latest data
>
> This means there could be a race, that "write duplication" writes the
> latest data to disk, then scrub write back the old data, causing data
> corruption.

Worth mentioning this is for nocow writes only then.
Given that the test cases that fail use fsstress and don't use nocow
files or -o nodatacow, the only possible case is writes into prealloc
extents.
Write duplication writes the new data and then extent iteration writes
zeroes (or whatever is on disk) after that.

>
> In theory, this should only affects data, not metadata.
> Metadata write back only happens when committing transaction, thus it's
> always after scrub writes.

No, not only when committing transaction.
It can happen under memory pressure, tree extents can be written
before. In fact, if you remember the 5.2 corruption and deadlock, the
deadlock case happened precisely when writeback of the btree inode was
triggered before a transaction commit.

>
> [FIX]
> Make dev-replace to require mandatory RO for target block group.
>
> And to be extra safe, for dev-replace, wait for all exiting writes to
> finish before scrubbing the chunk.
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed").
> ENOSPC for dev-replace is still much better than data corruption.
>
> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
> Not concretely confirmed, mostly through guess, thus it has RFC tag.

Well, it's better to confirm...
IIRC, correctly, dev-replace does not skip copies for prealloc
extents, it copies what is on disk.
If that's the case, then this is correct. However if it's smart and
skips copying prealloc extents (which is pointless), then the problem
must have other technical explanation.

>
> My first guess is race at the dev-replace starting point, but related
> code is in fact very safe.
> ---
>  fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..69e76a4d1258 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3472,6 +3472,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>         struct btrfs_path *path;
>         struct btrfs_fs_info *fs_info = sctx->fs_info;
>         struct btrfs_root *root = fs_info->dev_root;
> +       bool is_dev_replace = sctx->is_dev_replace;

Not needed, just use sctx->is_dev_replace like everywhere else.

Thanks.

>         u64 length;
>         u64 chunk_offset;
>         int ret = 0;
> @@ -3577,17 +3578,35 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                *
> +                * On the other hand, try our best to mark block group RO for
> +                * dev-replace case.
> +                *
> +                * Dev-replace has two types of write:
> +                * - Write duplication
> +                *   New write will be written to both target and source device
> +                *   The content is always the *newest* data.
> +                * - Scrub write for dev-replace
> +                *   Scrub will write the verified data for dev-replace.
> +                *   The data and its csum are all from *commit* root, which
> +                *   is not the newest version.
> +                *
> +                * If scrub write happens after write duplication, we would
> +                * cause data corruption.
> +                * So we need to try our best to mark block group RO, and exit
> +                * out if we don't have enough space.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> +               ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
>                 scrub_pause_off(fs_info);
>
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3605,6 +3624,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                 dev_replace->item_needs_writeback = 1;
>                 up_write(&dev_replace->rwsem);
>
> +               /*
> +                * Also wait for any exitings writes to prevent race between
> +                * write duplication and scrub writes.
> +                */
> +               if (is_dev_replace) {
> +                       btrfs_wait_block_group_reservations(cache);
> +                       btrfs_wait_nocow_writers(cache);
> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX,
> +                                       cache->start, cache->length);
> +               }
>                 ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
>                                   found_key.offset, cache);
>
> --
> 2.25.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux