Re: RAID5/6 permanent corruption of metadata and data extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2020/4/2 下午7:08, Filipe Manana wrote:
> Hi,
> 
> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> Typically when it fails we have something like the following in dmesg/syslog:
> 
>  (...)
>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>  BTRFS warning (device sdc): failed to load free space cache for block
> group 38797312, rebuilding it now
>  BTRFS info (device sdc): balance: start -d -m -s
>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> (dev /dev/sde sector 18688)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> (dev /dev/sde sector 18696)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> (dev /dev/sde sector 18704)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> (dev /dev/sde sector 18712)
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> (dev /dev/sde sector 718728)
>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> (dev /dev/sde sector 718720)
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS info (device sdc): balance: ended with status: -5
>  (...)
> 
> So I finally looked into it to figure out why that happens.
> 
> Consider the following scenario and steps that explain how we end up
> with a metadata extent
> permanently corrupt and unrecoverable (when it shouldn't be possible).
> 
> * We have a RAID5 filesystem consisting of three devices, with device
> IDs of 1, 2 and 3;
> 
> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> 
> * We have a single metadata block group that starts at logical offset
> 38797312 and has a
>   length of 715784192 bytes.
> 
> The following steps lead to a permanent corruption of a metadata extent:
> 
> 1) We make device 3 unavailable and mount the filesystem in degraded
> mode, so only
>    devices 1 and 2 are online;
> 
> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>    within the full stripe that starts at logical address 38928384, which is
>    composed of 3 stripes, each with a size of 64Kb:
> 
>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> stripe 3, offset 39059456 ]
>    (the offsets are logical addresses)
> 
>    stripe 1 is in device 2
>    stripe 2 is in device 3
>    stripe 3 is in device 1  (this is the parity stripe)
> 
>    Our extent buffer 39043072 falls into stripe 2, starting at page
> with index 12
>    of that stripe and ending at page with index 15;
> 
> 3) When writing the new extent buffer at address 39043072 we obviously
> don't write
>    the second stripe since device 3 is missing and we are in degraded
> mode. We write
>    only the stripes for devices 1 and 2, which are enough to recover
> stripe 2 content
>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> the correct
>    content of stripe 2);
> 
> 4) We unmount the filesystem;
> 
> 5) We make device 3 available and then mount the filesystem in
> non-degraded mode;
> 
> 6) Due to some write operation (such as relocation like btrfs/125
> does), we allocate
>    a new extent buffer at logical address 38993920. This belongs to
> the same full
>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>    and it's mapped to stripe 2 of that full stripe as well,
> corresponding to page
>    indexes from 0 to 3 of that stripe;
> 
> 7) When we do the actual write of this stripe, because it's a partial
> stripe write
>    (we aren't writing to all the pages of all the stripes of the full
> stripe), we
>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>    all the pages of stripe 1 from disk in order to compute the content for the
>    parity stripe. So we submit bios to read those pages from the corresponding
>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>    assume whatever we read from the devices is valid - in this case what we read
>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>    mount we haven't written extent buffer 39043072 to it - so we get
> garbage from
>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>    anything completely random). Then we compute the content for the
> parity stripe
>    based on that invalid content we read from device 3 and write the
> parity stripe
>    (and the other two stripes) to disk;
> 
> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>    belongs to a stripe of device 3, remember step 2), so
> btree_read_extent_buffer_pages()
>    triggers a recovery attempt - this happens through:
> 
>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>        -> raid56_parity_recover()
> 
>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>    stripe) by XORing the content of these last two. However the parity
> stripe was
>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> 
> This results in the impossibility to recover an extent buffer and
> getting permanent
> metadata corruption. If the read of the extent buffer 39043072
> happened before the
> write of extent buffer 38993920, we would have been able to recover it since the
> parity stripe reflected correct content, it matched what was written in degraded
> mode at steps 2 and 3.
> 
> The same type of issue happens for data extents as well.
> 
> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> 
> And we don't need to do writes in degraded mode and then mount in non-degraded
> mode with the previously missing device for this to happen (I gave the example
> of degraded mode because that's what btrfs/125 exercises).

This also means, other raid5/6 implementations are also affected by the
same problem, right?

> 
> Any scenario where the on disk content for an extent changed (some bit flips for
> example) can result in a permanently unrecoverable metadata or data extent if we
> have the bad luck of having a partial stripe write happen before an attempt to
> read and recover a corrupt extent in the same stripe.
> 
> Zygo had a report some months ago where he experienced this as well:
> 
> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@xxxxxxxxxxxxxx/
> 
> Haven't tried his script to reproduce, but it's very likely it's due to this
> issue caused by partial stripe writes before reads and recovery attempts.
> 
> This is a problem that has been around since raid5/6 support was added, and it
> seems to me it's something that was not thought about in the initial design.
> 
> The validation/check of an extent (both metadata and data) happens at a higher
> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> fails validation.
> 
> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> imply:
> 
> 1) Attempts to validate all extents of a stripe before doing a partial write,
> which not only would be a performance killer and terribly complex, ut would
> also be very messy to organize this in respect to proper layering of
> responsabilities;

Yes, this means raid56 layer will rely on extent tree to do
verification, and too complex.

Not really worthy to me too.

> 
> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> it never allocates an extent from a stripe that already has extents that were
> allocated by past transactions. However data extent allocation is currently
> done without holding a transaction open (and forgood reasons) during
> writeback. Would need more thought to see how viable it is, but not trivial
> either.
> 
> Any thoughts? Perhaps someone else was already aware of this problem and
> had thought about this before. Josef?

What about using sector size as device stripe size?

It would make metadata scrubbing suffer, and would cause performance
problems I guess, but it looks a little more feasible.

Thanks,
Qu

> 
> Thanks.
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux