On Fri, Jan 24, 2020 at 12:07 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>
> [BUG]
> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> looped runs can lead to random failure, where scrub finds csum error.
>
> The possibility is not high, around 1/20 to 1/100, but it's causing data
> corruption.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Dev-replace has two source of writes:
> - Write duplication
> All writes to source device will also be duplicated to target device.
>
> Content: Not yet persisted data/meta
>
> - Scrub copy
> Dev-replace reused scrub code to iterate through existing extents, and
> copy the verified data to target device.
>
> Content: Previously persisted data and metadata
>
> The difference in contents makes the following race possible:
> Regular Writer | Dev-replace
> -----------------------------------------------------------------
> ^ |
> | Preallocate one data extent |
> | at bytenr X, len 1M |
> v |
> ^ Commit transaction |
> | Now extent [X, X+1M) is in |
> v commit root |
> ================== Dev replace starts =========================
> | ^
> | | Scrub extent [X, X+1M)
> | | Read [X, X+1M)
> | | (The content are mostly garbage
> | | since it's preallocated)
> ^ | v
> | Write back happens for |
> | extent [X, X+512K) |
> | New data writes to both |
> | source and target dev. |
> v |
> | ^
> | | Scrub writes back extent [X, X+1M)
> | | to target device.
> | | This will over write the new data in
> | | [X, X+512K)
> | v
>
> This race can only happen for nocow writes. Thus metadata and data cow
> writes are safe, as COW will never overwrite extents of previous trans
> (in commit root).
>
> This behavior can be confirmed by disabling all fallocate related calls
> in fsstress (*), then all related tests can pass a 2000 run loop.
>
> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
> -f collapse=0 -f punch=0 -f resvsp=0"
> I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>
> [FIX]
> Make dev-replace to require mandatory block group RO, and wait for current
> nocow writes before calling scrub_chunk().
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed") for dev-replace path.
>
> The side effect is, dev-replace can be more strict on avaialble space, but
> definitely worthy to avoid data corruption.
>
> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx>
Now it looks good, thanks!
> ---
> Changelog:
> RFC->v1:
> - Remove the RFC tag
> Since the cause is pinned and verified, no need for RFC.
>
> - Only wait for nocow writes for dev-replace
> CoW writes are safe as they will never overwrite extents in commit
> root.
>
> - Put the wait call into proper lock context
> Previous wait happens after scrub_pause_off(), which can cause
> deadlock where we may need to commit transaction in one of the
> wait calls. But since we are in scrub_pause_off() environment,
> transaction commit will wait us to continue, causing a wait-on-self
> deadlock.
>
> v2:
> - Add btrfs_wait_ordered_roots() call before scrub_chunk().
> - Commit message change to avoid confusion.
> ---
> fs/btrfs/scrub.c | 33 ++++++++++++++++++++++++++++-----
> 1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..fd266a2d15ec 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> * This can easily boost the amount of SYSTEM chunks if cleaner
> * thread can't be triggered fast enough, and use up all space
> * of btrfs_super_block::sys_chunk_array
> + *
> + * While for dev replace, we need to try our best to mark block
> + * group RO, to prevent race between:
> + * - Write duplication
> + * Contains latest data
> + * - Scrub copy
> + * Contains data from commit tree
> + *
> + * If target block group is not marked RO, nocow writes can
> + * be overwritten by scrub copy, causing data corruption.
> + * So for dev-replace, it's not allowed to continue if a block
> + * group is not RO.
> */
> - ret = btrfs_inc_block_group_ro(cache, false);
> - scrub_pause_off(fs_info);
> -
> + ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
> if (ret == 0) {
> ro_set = 1;
> - } else if (ret == -ENOSPC) {
> + } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
> /*
> * btrfs_inc_block_group_ro return -ENOSPC when it
> * failed in creating new chunk for metadata.
> - * It is not a problem for scrub/replace, because
> + * It is not a problem for scrub, because
> * metadata are always cowed, and our scrub paused
> * commit_transactions.
> */
> @@ -3596,9 +3606,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> btrfs_warn(fs_info,
> "failed setting block group ro: %d", ret);
> btrfs_put_block_group(cache);
> + scrub_pause_off(fs_info);
> break;
> }
>
> + /*
> + * Now the target block is marked RO, wait for nocow writes to
> + * finish before dev-replace.
> + * COW is fine, as COW never overwrites extents in commit tree.
> + */
> + if (sctx->is_dev_replace) {
> + btrfs_wait_nocow_writers(cache);
> + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
> + cache->length);
> + }
> +
> + scrub_pause_off(fs_info);
> down_write(&dev_replace->rwsem);
> dev_replace->cursor_right = found_key.offset + length;
> dev_replace->cursor_left = found_key.offset;
> --
> 2.25.0
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”