Re: RAID5/6 permanent corruption of metadata and data extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 3, 2020 at 11:12 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2020/4/3 下午6:04, Filipe Manana wrote:
> > On Fri, Apr 3, 2020 at 1:00 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2020/4/2 下午9:26, Filipe Manana wrote:
> >>> On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/4/2 下午8:33, Filipe Manana wrote:
> >>>>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> >>>>>>> Typically when it fails we have something like the following in dmesg/syslog:
> >>>>>>>
> >>>>>>>  (...)
> >>>>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >>>>>>>  BTRFS warning (device sdc): failed to load free space cache for block
> >>>>>>> group 38797312, rebuilding it now
> >>>>>>>  BTRFS info (device sdc): balance: start -d -m -s
> >>>>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> >>>>>>> (dev /dev/sde sector 18688)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> >>>>>>> (dev /dev/sde sector 18696)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> >>>>>>> (dev /dev/sde sector 18704)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> >>>>>>> (dev /dev/sde sector 18712)
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> >>>>>>> (dev /dev/sde sector 718728)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> >>>>>>> (dev /dev/sde sector 718720)
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS info (device sdc): balance: ended with status: -5
> >>>>>>>  (...)
> >>>>>>>
> >>>>>>> So I finally looked into it to figure out why that happens.
> >>>>>>>
> >>>>>>> Consider the following scenario and steps that explain how we end up
> >>>>>>> with a metadata extent
> >>>>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
> >>>>>>>
> >>>>>>> * We have a RAID5 filesystem consisting of three devices, with device
> >>>>>>> IDs of 1, 2 and 3;
> >>>>>>>
> >>>>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >>>>>>>
> >>>>>>> * We have a single metadata block group that starts at logical offset
> >>>>>>> 38797312 and has a
> >>>>>>>   length of 715784192 bytes.
> >>>>>>>
> >>>>>>> The following steps lead to a permanent corruption of a metadata extent:
> >>>>>>>
> >>>>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
> >>>>>>> mode, so only
> >>>>>>>    devices 1 and 2 are online;
> >>>>>>>
> >>>>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >>>>>>>    within the full stripe that starts at logical address 38928384, which is
> >>>>>>>    composed of 3 stripes, each with a size of 64Kb:
> >>>>>>>
> >>>>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> >>>>>>> stripe 3, offset 39059456 ]
> >>>>>>>    (the offsets are logical addresses)
> >>>>>>>
> >>>>>>>    stripe 1 is in device 2
> >>>>>>>    stripe 2 is in device 3
> >>>>>>>    stripe 3 is in device 1  (this is the parity stripe)
> >>>>>>>
> >>>>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
> >>>>>>> with index 12
> >>>>>>>    of that stripe and ending at page with index 15;
> >>>>>>>
> >>>>>>> 3) When writing the new extent buffer at address 39043072 we obviously
> >>>>>>> don't write
> >>>>>>>    the second stripe since device 3 is missing and we are in degraded
> >>>>>>> mode. We write
> >>>>>>>    only the stripes for devices 1 and 2, which are enough to recover
> >>>>>>> stripe 2 content
> >>>>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> >>>>>>> the correct
> >>>>>>>    content of stripe 2);
> >>>>>>>
> >>>>>>> 4) We unmount the filesystem;
> >>>>>>>
> >>>>>>> 5) We make device 3 available and then mount the filesystem in
> >>>>>>> non-degraded mode;
> >>>>>>>
> >>>>>>> 6) Due to some write operation (such as relocation like btrfs/125
> >>>>>>> does), we allocate
> >>>>>>>    a new extent buffer at logical address 38993920. This belongs to
> >>>>>>> the same full
> >>>>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >>>>>>>    and it's mapped to stripe 2 of that full stripe as well,
> >>>>>>> corresponding to page
> >>>>>>>    indexes from 0 to 3 of that stripe;
> >>>>>>>
> >>>>>>> 7) When we do the actual write of this stripe, because it's a partial
> >>>>>>> stripe write
> >>>>>>>    (we aren't writing to all the pages of all the stripes of the full
> >>>>>>> stripe), we
> >>>>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >>>>>>>    all the pages of stripe 1 from disk in order to compute the content for the
> >>>>>>>    parity stripe. So we submit bios to read those pages from the corresponding
> >>>>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >>>>>>>    assume whatever we read from the devices is valid - in this case what we read
> >>>>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >>>>>>>    mount we haven't written extent buffer 39043072 to it - so we get
> >>>>>>> garbage from
> >>>>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >>>>>>>    anything completely random). Then we compute the content for the
> >>>>>>> parity stripe
> >>>>>>>    based on that invalid content we read from device 3 and write the
> >>>>>>> parity stripe
> >>>>>>>    (and the other two stripes) to disk;
> >>>>>>>
> >>>>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >>>>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >>>>>>>    belongs to a stripe of device 3, remember step 2), so
> >>>>>>> btree_read_extent_buffer_pages()
> >>>>>>>    triggers a recovery attempt - this happens through:
> >>>>>>>
> >>>>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >>>>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >>>>>>>        -> raid56_parity_recover()
> >>>>>>>
> >>>>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >>>>>>>    stripe) by XORing the content of these last two. However the parity
> >>>>>>> stripe was
> >>>>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >>>>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >>>>>>>
> >>>>>>> This results in the impossibility to recover an extent buffer and
> >>>>>>> getting permanent
> >>>>>>> metadata corruption. If the read of the extent buffer 39043072
> >>>>>>> happened before the
> >>>>>>> write of extent buffer 38993920, we would have been able to recover it since the
> >>>>>>> parity stripe reflected correct content, it matched what was written in degraded
> >>>>>>> mode at steps 2 and 3.
> >>>>>>>
> >>>>>>> The same type of issue happens for data extents as well.
> >>>>>>>
> >>>>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> >>>>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >>>>>>>
> >>>>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
> >>>>>>> mode with the previously missing device for this to happen (I gave the example
> >>>>>>> of degraded mode because that's what btrfs/125 exercises).
> >>>>>>
> >>>>>> This also means, other raid5/6 implementations are also affected by the
> >>>>>> same problem, right?
> >>>>>
> >>>>> If so, that makes them less useful as well.
> >>>>> For all the other raid modes we support, which use mirrors, we don't
> >>>>> have this problem. If one copy is corrupt, we are able to recover it,
> >>>>> period.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Any scenario where the on disk content for an extent changed (some bit flips for
> >>>>>>> example) can result in a permanently unrecoverable metadata or data extent if we
> >>>>>>> have the bad luck of having a partial stripe write happen before an attempt to
> >>>>>>> read and recover a corrupt extent in the same stripe.
> >>>>>>>
> >>>>>>> Zygo had a report some months ago where he experienced this as well:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@xxxxxxxxxxxxxx/
> >>>>>>>
> >>>>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
> >>>>>>> issue caused by partial stripe writes before reads and recovery attempts.
> >>>>>>>
> >>>>>>> This is a problem that has been around since raid5/6 support was added, and it
> >>>>>>> seems to me it's something that was not thought about in the initial design.
> >>>>>>>
> >>>>>>> The validation/check of an extent (both metadata and data) happens at a higher
> >>>>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> >>>>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> >>>>>>> fails validation.
> >>>>>>>
> >>>>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> >>>>>>> imply:
> >>>>>>>
> >>>>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
> >>>>>>> which not only would be a performance killer and terribly complex, ut would
> >>>>>>> also be very messy to organize this in respect to proper layering of
> >>>>>>> responsabilities;
> >>>>>>
> >>>>>> Yes, this means raid56 layer will rely on extent tree to do
> >>>>>> verification, and too complex.
> >>>>>>
> >>>>>> Not really worthy to me too.
> >>>>>>
> >>>>>>>
> >>>>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> >>>>>>> it never allocates an extent from a stripe that already has extents that were
> >>>>>>> allocated by past transactions. However data extent allocation is currently
> >>>>>>> done without holding a transaction open (and forgood reasons) during
> >>>>>>> writeback. Would need more thought to see how viable it is, but not trivial
> >>>>>>> either.
> >>>>>>>
> >>>>>>> Any thoughts? Perhaps someone else was already aware of this problem and
> >>>>>>> had thought about this before. Josef?
> >>>>>>
> >>>>>> What about using sector size as device stripe size?
> >>>>>
> >>>>> Unfortunately that wouldn't work as well.
> >>>>>
> >>>>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
> >>>>> write to a metadata extent located at stripe 2 - this partial write
> >>>>> (because it doesn't cover all stripes of the full stripe), will read
> >>>>> the pages from the first stripe and assume they are all good, and then
> >>>>> use those for computing the parity stripe - based on a corrupt extent
> >>>>> as well. Same problem I described, but this time the corrupt extent is
> >>>>> in a different stripe of the same full stripe.
> >>>>
> >>>> Yep, I also recognized that problem after some time.
> >>>>
> >>>> Another possible solution is, always write 0 bytes for pinned extents
> >>>> (I'm only thinking metadata yet).
> >>>>
> >>>> This means, at transaction commit time, we also need to write 0 for
> >>>> pinned extents before next transaction starts.
> >>>
> >>> I'm assuming you mean to write zeroes to pinned extents when unpinning
> >>> them- after writing the superblock of the committing transaction.
> >>
> >> So the timing of unpinning them would also need to be changed.
> >>
> >> As mentioned, it needs to be before starting next transaction.
> >
> > That would be terrible for performance.
> > Any operation that needs to start a transaction would block until the
> > current transaction finishes its commit, which means waiting for a lot
> > of IO.
> > Would bring terrible stalls to applications - most operations need to
> > start/join a transaction - create file, symlink, link, unlink, rename,
> > clone/dedupe, mkdir, rmdir, mknod, truncate, create/delete snapshot,
> > etc etc etc.
>
> Exactly.
>
> >
> >>
> >> Anyway, my point is, if we ensure all unwritten data contains certain
> >> pattern (for metadata), then we can at least use them to detect out of
> >> sync full stripe.
> >
> > That would also cause extra IO to be done if the device doesn't
> > support trim/discard, as we would to have to write the zeroes.
>
> And even for trim supported device, there is no guarantee that read on
> trimmed range would return 0.
> So that zero write is unavoidable.
>
> >
> > Sadly, it wouldn't solve the data extents case.
>
> And the zero write is in fact a compromise to make raid56 to do
> verification work without looking up extent tree.
>
> It trades tons of IO, just to get some bool info about if there are any
> extents in the partial stripes of the full stripe.
>
> So this crazy idea lends more towards extent tree lookup in raid56 layer.
> (And now I think it's less crazy to do extent tree lookup now)

Looking into the extent tree, or any other tree, even using commit
roots has its disadvantages too, see the reply I just sent for Zygo.
It also requires preventing transaction commits from happening while
we are doing the checks.


>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>> But
> >>> way before that, the next transaction may have already started, and
> >>> metadata and data writes may have already started as well, think of
> >>> fsync() or writepages() being called by the vm due to memory pressure
> >>> for example (or page migration, etc).
> >>>
> >>>> This needs some extra work, and will definite re-do a lot of parity
> >>>> re-calculation, which would definitely affect performance.
> >>>>
> >>>> So for a new partial write, before we write the new stripe, we read the
> >>>> remaining data stripes (which we already need) and the parity stripe
> >>>> (the new thing).
> >>>>
> >>>> We do the rebuild, if the rebuild result is all zero, then it means the
> >>>> full stripe is OK, we do regular write.
> >>>>
> >>>> If the rebuild result is not all zero, it means the full stripe is not
> >>>> consistent, do some repair before write the partial stripe.
> >>>>
> >>>> However this only works for metadata, and metadata is never all 0, so
> >>>> all zero page can be an indicator.
> >>>>
> >>>> How this crazy idea looks?
> >>>
> >>> Extremely crazy :/
> >>> I don't see how it would work due to the above comment.
> >>>
> >>> thanks
> >>>
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>>
> >>>>>> It would make metadata scrubbing suffer, and would cause performance
> >>>>>> problems I guess, but it looks a little more feasible.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Qu
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux