Re: [PATCH] btrfs: scrub: Require mandatory block group RO for dev-replace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 23, 2020 at 1:39 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2020/1/23 下午8:06, Filipe Manana wrote:
> > On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
> >>
> >> [BUG]
> >> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> >> looped runs can lead to random failure, where scrub finds csum error.
> >>
> >> The possibility is not high, around 1/20 to 1/100, but it's causing data
> >> corruption.
> >>
> >> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> >> check free space before marking a block group RO")
> >>
> >> [CAUSE]
> >> Dev-replace has two source of writes:
> >> - Write duplication
> >>   All writes to source device will also be duplicated to target device.
> >>
> >>   Content:      Latest data/meta
> >
> > I find the term "latest" a bit confusing, perhaps "not yet persisted
> > data and metadata" is more clear.
> >
> >>
> >> - Scrub copy
> >>   Dev-replace reused scrub code to iterate through existing extents, and
> >>   copy the verified data to target device.
> >>
> >>   Content:      Data/meta in commit root
> >
> > And so here "previously persisted data and metadata".
> >
> >>
> >> The difference in contents makes the following race possible:
> >>         Regular Writer          |       Dev-replace
> >> -----------------------------------------------------------------
> >>   ^                             |
> >>   | Preallocate one data extent |
> >>   | at bytenr X, len 1M         |
> >>   v                             |
> >>   ^ Commit transaction          |
> >>   | Now extent [X, X+1M) is in  |
> >>   v commit root                 |
> >>  ================== Dev replace starts =========================
> >>                                 | ^
> >>                                 | | Scrub extent [X, X+1M)
> >>                                 | | Read [X, X+1M)
> >>                                 | | (The content are mostly garbage
> >>                                 | |  since it's preallocated)
> >>   ^                             | v
> >>   | Write back happens for      |
> >>   | extent [X, X+512K)          |
> >>   | New data writes to both     |
> >>   | source and target dev.      |
> >>   v                             |
> >>                                 | ^
> >>                                 | | Scrub writes back extent [X, X+1M)
> >>                                 | | to target device.
> >>                                 | | This will over write the new data in
> >>                                 | | [X, X+512K)
> >>                                 | v
> >>
> >> This race can only happen for nocow writes. Thus metadata and data cow
> >> writes are safe, as COW will never overwrite extents of previous trans
> >> (in commit root).
> >>
> >> This behavior can be confirmed by disabling all fallocate related calls
> >> in fsstress (*), then all related tests can pass a 2000 run loop.
> >>
> >> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
> >>                    -f collapse=0 -f punch=0 -f resvsp=0"
> >>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
> >>
> >> [FIX]
> >> Make dev-replace to require mandatory block group RO, and wait for current
> >> nocow writes before calling scrub_chunk().
> >>
> >> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> >> when set_block_ro failed") for dev-replace path.
> >>
> >> ENOSPC for dev-replace is still much better than data corruption.
> >
> > Technically if we flag the block group RO without being able to
> > persist that due to ENOSPC, it's still ok.
> > We just want to prevent nocow writes racing with scrub copying
> > procedure. But that's something for some other time, and this is fine
> > to me.
> >
> >>
> >> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
> >> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> >> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> >> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> >> ---
> >> Changelog:
> >> RFC->v1:
> >> - Remove the RFC tag
> >>   Since the cause is pinned and verified, no need for RFC.
> >>
> >> - Only wait for nocow writes for dev-replace
> >>   CoW writes are safe as they will never overwrite extents in commit
> >>   root.
> >>
> >> - Put the wait call into proper lock context
> >>   Previous wait happens after scrub_pause_off(), which can cause
> >>   deadlock where we may need to commit transaction in one of the
> >>   wait calls. But since we are in scrub_pause_off() environment,
> >>   transaction commit will wait us to continue, causing a wait-on-self
> >>   deadlock.
> >> ---
> >>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
> >>  1 file changed, 25 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 21de630b0730..5aa486cad298 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                  * This can easily boost the amount of SYSTEM chunks if cleaner
> >>                  * thread can't be triggered fast enough, and use up all space
> >>                  * of btrfs_super_block::sys_chunk_array
> >> +                *
> >> +                * While for dev replace, we need to try our best to mark block
> >> +                * group RO, to prevent race between:
> >> +                * - Write duplication
> >> +                *   Contains latest data
> >> +                * - Scrub copy
> >> +                *   Contains data from commit tree
> >> +                *
> >> +                * If target block group is not marked RO, nocow writes can
> >> +                * be overwritten by scrub copy, causing data corruption.
> >> +                * So for dev-replace, it's not allowed to continue if a block
> >> +                * group is not RO.
> >>                  */
> >> -               ret = btrfs_inc_block_group_ro(cache, false);
> >> -               scrub_pause_off(fs_info);
> >> -
> >> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
> >>                 if (ret == 0) {
> >>                         ro_set = 1;
> >> -               } else if (ret == -ENOSPC) {
> >> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
> >>                         /*
> >>                          * btrfs_inc_block_group_ro return -ENOSPC when it
> >>                          * failed in creating new chunk for metadata.
> >> -                        * It is not a problem for scrub/replace, because
> >> +                        * It is not a problem for scrub, because
> >>                          * metadata are always cowed, and our scrub paused
> >>                          * commit_transactions.
> >>                          */
> >> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                         btrfs_warn(fs_info,
> >>                                    "failed setting block group ro: %d", ret);
> >>                         btrfs_put_block_group(cache);
> >> +                       scrub_pause_off(fs_info);
> >>                         break;
> >>                 }
> >>
> >> +               /*
> >> +                * Now the target block is marked RO, wait for nocow writes to
> >> +                * finish before dev-replace.
> >> +                * COW is fine, as COW never overwrites extents in commit tree.
> >> +                */
> >> +               if (sctx->is_dev_replace)
> >> +                       btrfs_wait_nocow_writers(cache);
> >
> > So this only waits for nocow ordered extents to be created - it
> > doesn't wait for them to complete.
>
> Wait for minute.
>
> This btrfs_wait_nocow_writers() is not just triggering ordered extents
> for nocow.
> It waits for the nocow_writers count decreased to 0 for that block group.
>
> Since we have already marked the block group RO, no new nocow writers
> can happen, thus that counter can only decrease, no way to increase.
>
> There are several cases involved:
> - NoCOW Write back happens before bg RO
>   It will increase cache->nocow_writers counter, and decrease the
>   counter after finish_oredered_io().

Nop. nocow_writers is decremented after creating the ordered extent
when starting writeback (at run_delalloc_nocow) - not when completing
the ordered extent (at btrfs_finish_ordered_io()).
Same applies direct IO.

Thanks


>   btrfs_wait_nocow_writers() will wait for such writes
>
> - Writeback happens after bg RO
>   Then the write will fallback to COW.
>
> - Writeback happens after bg RO cleared (reverted back to RW)
>   No need to bother at all.
>
> Thus this should be enough, no need for btrfs_wait_ordered_roots().
>
> Thanks,
> Qu
>
>
> > After that you still need to call:
> >
> > btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> >
> > Other than that, looks good to me.
> >
> > Thanks.
> >
> >> +
> >> +               scrub_pause_off(fs_info);
> >>                 down_write(&dev_replace->rwsem);
> >>                 dev_replace->cursor_right = found_key.offset + length;
> >>                 dev_replace->cursor_left = found_key.offset;
> >> --
> >> 2.25.0
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux