On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>> All writes to source device will also be duplicated to target device.
>>
>> Content: Latest data/meta
>
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
>
>>
>> - Scrub copy
>> Dev-replace reused scrub code to iterate through existing extents, and
>> copy the verified data to target device.
>>
>> Content: Data/meta in commit root
>
> And so here "previously persisted data and metadata".
>
>>
>> The difference in contents makes the following race possible:
>> Regular Writer | Dev-replace
>> -----------------------------------------------------------------
>> ^ |
>> | Preallocate one data extent |
>> | at bytenr X, len 1M |
>> v |
>> ^ Commit transaction |
>> | Now extent [X, X+1M) is in |
>> v commit root |
>> ================== Dev replace starts =========================
>> | ^
>> | | Scrub extent [X, X+1M)
>> | | Read [X, X+1M)
>> | | (The content are mostly garbage
>> | | since it's preallocated)
>> ^ | v
>> | Write back happens for |
>> | extent [X, X+512K) |
>> | New data writes to both |
>> | source and target dev. |
>> v |
>> | ^
>> | | Scrub writes back extent [X, X+1M)
>> | | to target device.
>> | | This will over write the new data in
>> | | [X, X+512K)
>> | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>> -f collapse=0 -f punch=0 -f resvsp=0"
>> I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
>
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.
Right, I will change the words, since it only slightly increase the
possibility of ENOSPC, not ensured to cause ENOSPC and abort replace.
> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
>
>>
>> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>> Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>> CoW writes are safe as they will never overwrite extents in commit
>> root.
>>
[...]
>> + /*
>> + * Now the target block is marked RO, wait for nocow writes to
>> + * finish before dev-replace.
>> + * COW is fine, as COW never overwrites extents in commit tree.
>> + */
>> + if (sctx->is_dev_replace)
>> + btrfs_wait_nocow_writers(cache);
>
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.
> After that you still need to call:
>
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
Forgot that, although 1000 runs doesn't expose any problem you are
completely right.
I'll update the patch to address all mentioned comments.
Thanks,
Qu
>
> Other than that, looks good to me.
>
> Thanks.
>
>> +
>> + scrub_pause_off(fs_info);
>> down_write(&dev_replace->rwsem);
>> dev_replace->cursor_right = found_key.offset + length;
>> dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
>
>