On Thu, Jan 23, 2020 at 1:39 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2020/1/23 下午8:06, Filipe Manana wrote:
> > On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
> >>
> >> [BUG]
> >> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> >> looped runs can lead to random failure, where scrub finds csum error.
> >>
> >> The possibility is not high, around 1/20 to 1/100, but it's causing data
> >> corruption.
> >>
> >> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> >> check free space before marking a block group RO")
> >>
> >> [CAUSE]
> >> Dev-replace has two source of writes:
> >> - Write duplication
> >> All writes to source device will also be duplicated to target device.
> >>
> >> Content: Latest data/meta
> >
> > I find the term "latest" a bit confusing, perhaps "not yet persisted
> > data and metadata" is more clear.
> >
> >>
> >> - Scrub copy
> >> Dev-replace reused scrub code to iterate through existing extents, and
> >> copy the verified data to target device.
> >>
> >> Content: Data/meta in commit root
> >
> > And so here "previously persisted data and metadata".
> >
> >>
> >> The difference in contents makes the following race possible:
> >> Regular Writer | Dev-replace
> >> -----------------------------------------------------------------
> >> ^ |
> >> | Preallocate one data extent |
> >> | at bytenr X, len 1M |
> >> v |
> >> ^ Commit transaction |
> >> | Now extent [X, X+1M) is in |
> >> v commit root |
> >> ================== Dev replace starts =========================
> >> | ^
> >> | | Scrub extent [X, X+1M)
> >> | | Read [X, X+1M)
> >> | | (The content are mostly garbage
> >> | | since it's preallocated)
> >> ^ | v
> >> | Write back happens for |
> >> | extent [X, X+512K) |
> >> | New data writes to both |
> >> | source and target dev. |
> >> v |
> >> | ^
> >> | | Scrub writes back extent [X, X+1M)
> >> | | to target device.
> >> | | This will over write the new data in
> >> | | [X, X+512K)
> >> | v
> >>
> >> This race can only happen for nocow writes. Thus metadata and data cow
> >> writes are safe, as COW will never overwrite extents of previous trans
> >> (in commit root).
> >>
> >> This behavior can be confirmed by disabling all fallocate related calls
> >> in fsstress (*), then all related tests can pass a 2000 run loop.
> >>
> >> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
> >> -f collapse=0 -f punch=0 -f resvsp=0"
> >> I didn't expect resvsp ioctl will fallback to fallocate in VFS...
> >>
> >> [FIX]
> >> Make dev-replace to require mandatory block group RO, and wait for current
> >> nocow writes before calling scrub_chunk().
> >>
> >> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> >> when set_block_ro failed") for dev-replace path.
> >>
> >> ENOSPC for dev-replace is still much better than data corruption.
> >
> > Technically if we flag the block group RO without being able to
> > persist that due to ENOSPC, it's still ok.
> > We just want to prevent nocow writes racing with scrub copying
> > procedure. But that's something for some other time, and this is fine
> > to me.
> >
> >>
> >> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
> >> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> >> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> >> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> >> ---
> >> Changelog:
> >> RFC->v1:
> >> - Remove the RFC tag
> >> Since the cause is pinned and verified, no need for RFC.
> >>
> >> - Only wait for nocow writes for dev-replace
> >> CoW writes are safe as they will never overwrite extents in commit
> >> root.
> >>
> >> - Put the wait call into proper lock context
> >> Previous wait happens after scrub_pause_off(), which can cause
> >> deadlock where we may need to commit transaction in one of the
> >> wait calls. But since we are in scrub_pause_off() environment,
> >> transaction commit will wait us to continue, causing a wait-on-self
> >> deadlock.
> >> ---
> >> fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
> >> 1 file changed, 25 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 21de630b0730..5aa486cad298 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >> * This can easily boost the amount of SYSTEM chunks if cleaner
> >> * thread can't be triggered fast enough, and use up all space
> >> * of btrfs_super_block::sys_chunk_array
> >> + *
> >> + * While for dev replace, we need to try our best to mark block
> >> + * group RO, to prevent race between:
> >> + * - Write duplication
> >> + * Contains latest data
> >> + * - Scrub copy
> >> + * Contains data from commit tree
> >> + *
> >> + * If target block group is not marked RO, nocow writes can
> >> + * be overwritten by scrub copy, causing data corruption.
> >> + * So for dev-replace, it's not allowed to continue if a block
> >> + * group is not RO.
> >> */
> >> - ret = btrfs_inc_block_group_ro(cache, false);
> >> - scrub_pause_off(fs_info);
> >> -
> >> + ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
> >> if (ret == 0) {
> >> ro_set = 1;
> >> - } else if (ret == -ENOSPC) {
> >> + } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
> >> /*
> >> * btrfs_inc_block_group_ro return -ENOSPC when it
> >> * failed in creating new chunk for metadata.
> >> - * It is not a problem for scrub/replace, because
> >> + * It is not a problem for scrub, because
> >> * metadata are always cowed, and our scrub paused
> >> * commit_transactions.
> >> */
> >> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >> btrfs_warn(fs_info,
> >> "failed setting block group ro: %d", ret);
> >> btrfs_put_block_group(cache);
> >> + scrub_pause_off(fs_info);
> >> break;
> >> }
> >>
> >> + /*
> >> + * Now the target block is marked RO, wait for nocow writes to
> >> + * finish before dev-replace.
> >> + * COW is fine, as COW never overwrites extents in commit tree.
> >> + */
> >> + if (sctx->is_dev_replace)
> >> + btrfs_wait_nocow_writers(cache);
> >
> > So this only waits for nocow ordered extents to be created - it
> > doesn't wait for them to complete.
>
> Wait for minute.
>
> This btrfs_wait_nocow_writers() is not just triggering ordered extents
> for nocow.
> It waits for the nocow_writers count decreased to 0 for that block group.
>
> Since we have already marked the block group RO, no new nocow writers
> can happen, thus that counter can only decrease, no way to increase.
>
> There are several cases involved:
> - NoCOW Write back happens before bg RO
> It will increase cache->nocow_writers counter, and decrease the
> counter after finish_oredered_io().
Nop. nocow_writers is decremented after creating the ordered extent
when starting writeback (at run_delalloc_nocow) - not when completing
the ordered extent (at btrfs_finish_ordered_io()).
Same applies direct IO.
Thanks
> btrfs_wait_nocow_writers() will wait for such writes
>
> - Writeback happens after bg RO
> Then the write will fallback to COW.
>
> - Writeback happens after bg RO cleared (reverted back to RW)
> No need to bother at all.
>
> Thus this should be enough, no need for btrfs_wait_ordered_roots().
>
> Thanks,
> Qu
>
>
> > After that you still need to call:
> >
> > btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> >
> > Other than that, looks good to me.
> >
> > Thanks.
> >
> >> +
> >> + scrub_pause_off(fs_info);
> >> down_write(&dev_replace->rwsem);
> >> dev_replace->cursor_right = found_key.offset + length;
> >> dev_replace->cursor_left = found_key.offset;
> >> --
> >> 2.25.0
> >>
> >
> >
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”