On 2020/1/22 下午6:05, Filipe Manana wrote:
> On Wed, Jan 22, 2020 at 8:37 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>>
>> [BUG]
>> btrfs/06[45] btrfs/071 could fail by finding csum error.
>> The reproducibility is not high, around 1/20~1/100, needs to run them in
>> loops.
>>
>> And the profile doesn't make much difference, SINGLE/SINGLE can also
>> reproduce the problem.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Device replace reuses scrub code to iterate existing extents.
>>
>> It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
>> that scrub read can write the verified data to target device.
>>
>> Device replace also utilizes "write duplication" to write new data to
>> both source and target device.
>>
>> However those two write can conflict and may lead to data corruption:
>> - Scrub writes old data from commit root
>> Both extent location and csum are fetched from commit root, which
>> is not always the up-to-date data.
>>
>> - Write duplication is always duplicating latest data
>>
>> This means there could be a race, that "write duplication" writes the
>> latest data to disk, then scrub write back the old data, causing data
>> corruption.
>
> Worth mentioning this is for nocow writes only then.
> Given that the test cases that fail use fsstress and don't use nocow
> files or -o nodatacow, the only possible case is writes into prealloc
> extents.
> Write duplication writes the new data and then extent iteration writes
> zeroes (or whatever is on disk) after that.
Thank you very much for the mentioning of prealloc extents, that's
exactly the missing piece!
My original assumption in fact has a hole, extents in commit tree won't
get re-allocated as they will get pinned down, and until next trans
won't be re-used.
So the explaination should only work for nodatacow case, and I could not
find a good explanation until now.
And if it's prealloc extent, then it's indeed a different story.
>
>>
>> In theory, this should only affects data, not metadata.
>> Metadata write back only happens when committing transaction, thus it's
>> always after scrub writes.
>
> No, not only when committing transaction.
> It can happen under memory pressure, tree extents can be written
> before. In fact, if you remember the 5.2 corruption and deadlock, the
> deadlock case happened precisely when writeback of the btree inode was
> triggered before a transaction commit.
>
>>
>> [FIX]
>> Make dev-replace to require mandatory RO for target block group.
>>
>> And to be extra safe, for dev-replace, wait for all exiting writes to
>> finish before scrubbing the chunk.
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed").
>> ENOSPC for dev-replace is still much better than data corruption.
>>
>> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> Not concretely confirmed, mostly through guess, thus it has RFC tag.
>
> Well, it's better to confirm...
> IIRC, correctly, dev-replace does not skip copies for prealloc
> extents, it copies what is on disk.
That's true, it doesn't do backref walk to determine if it's
preallocated or regular.
It just gather csum, copy pages from disk, verify if there is csum, then
copy the pages back.
So prealloc indeed looks like a very valid cause, and it can be verified
just by disabling prealloc in fsstress.
Thanks again for pointing out the missing piece.
Qu
> If that's the case, then this is correct. However if it's smart and
> skips copying prealloc extents (which is pointless), then the problem
> must have other technical explanation.
>
>>
>> My first guess is race at the dev-replace starting point, but related
>> code is in fact very safe.
>> ---
>> fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
>> 1 file changed, 32 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 21de630b0730..69e76a4d1258 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3472,6 +3472,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>> struct btrfs_path *path;
>> struct btrfs_fs_info *fs_info = sctx->fs_info;
>> struct btrfs_root *root = fs_info->dev_root;
>> + bool is_dev_replace = sctx->is_dev_replace;
>
> Not needed, just use sctx->is_dev_replace like everywhere else.
>
> Thanks.
>
>> u64 length;
>> u64 chunk_offset;
>> int ret = 0;
>> @@ -3577,17 +3578,35 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>> * This can easily boost the amount of SYSTEM chunks if cleaner
>> * thread can't be triggered fast enough, and use up all space
>> * of btrfs_super_block::sys_chunk_array
>> + *
>> + *
>> + * On the other hand, try our best to mark block group RO for
>> + * dev-replace case.
>> + *
>> + * Dev-replace has two types of write:
>> + * - Write duplication
>> + * New write will be written to both target and source device
>> + * The content is always the *newest* data.
>> + * - Scrub write for dev-replace
>> + * Scrub will write the verified data for dev-replace.
>> + * The data and its csum are all from *commit* root, which
>> + * is not the newest version.
>> + *
>> + * If scrub write happens after write duplication, we would
>> + * cause data corruption.
>> + * So we need to try our best to mark block group RO, and exit
>> + * out if we don't have enough space.
>> */
>> - ret = btrfs_inc_block_group_ro(cache, false);
>> + ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
>> scrub_pause_off(fs_info);
>>
>> if (ret == 0) {
>> ro_set = 1;
>> - } else if (ret == -ENOSPC) {
>> + } else if (ret == -ENOSPC && !is_dev_replace) {
>> /*
>> * btrfs_inc_block_group_ro return -ENOSPC when it
>> * failed in creating new chunk for metadata.
>> - * It is not a problem for scrub/replace, because
>> + * It is not a problem for scrub, because
>> * metadata are always cowed, and our scrub paused
>> * commit_transactions.
>> */
>> @@ -3605,6 +3624,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>> dev_replace->item_needs_writeback = 1;
>> up_write(&dev_replace->rwsem);
>>
>> + /*
>> + * Also wait for any exitings writes to prevent race between
>> + * write duplication and scrub writes.
>> + */
>> + if (is_dev_replace) {
>> + btrfs_wait_block_group_reservations(cache);
>> + btrfs_wait_nocow_writers(cache);
>> + btrfs_wait_ordered_roots(fs_info, U64_MAX,
>> + cache->start, cache->length);
>> + }
>> ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
>> found_key.offset, cache);
>>
>> --
>> 2.25.0
>>
>
>
Attachment:
signature.asc
Description: OpenPGP digital signature
