On Fri, Apr 3, 2020 at 1:00 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > > > > On 2020/4/2 下午9:26, Filipe Manana wrote: > > On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > >> > >> > >> > >> On 2020/4/2 下午8:33, Filipe Manana wrote: > >>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 2020/4/2 下午7:08, Filipe Manana wrote: > >>>>> Hi, > >>>>> > >>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails. > >>>>> Typically when it fails we have something like the following in dmesg/syslog: > >>>>> > >>>>> (...) > >>>>> BTRFS error (device sdc): space cache generation (7) does not match inode (9) > >>>>> BTRFS warning (device sdc): failed to load free space cache for block > >>>>> group 38797312, rebuilding it now > >>>>> BTRFS info (device sdc): balance: start -d -m -s > >>>>> BTRFS info (device sdc): relocating block group 754581504 flags data|raid5 > >>>>> BTRFS error (device sdc): bad tree block start, want 39059456 have 0 > >>>>> BTRFS info (device sdc): read error corrected: ino 0 off 39059456 > >>>>> (dev /dev/sde sector 18688) > >>>>> BTRFS info (device sdc): read error corrected: ino 0 off 39063552 > >>>>> (dev /dev/sde sector 18696) > >>>>> BTRFS info (device sdc): read error corrected: ino 0 off 39067648 > >>>>> (dev /dev/sde sector 18704) > >>>>> BTRFS info (device sdc): read error corrected: ino 0 off 39071744 > >>>>> (dev /dev/sde sector 18712) > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024 > >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1 > >>>>> BTRFS info (device sdc): read error corrected: ino 257 off 1380352 > >>>>> (dev /dev/sde sector 718728) > >>>>> BTRFS info (device sdc): read error corrected: ino 257 off 1376256 > >>>>> (dev /dev/sde sector 718720) > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS error (device sdc): bad tree block start, want 39043072 have 0 > >>>>> BTRFS info (device sdc): balance: ended with status: -5 > >>>>> (...) > >>>>> > >>>>> So I finally looked into it to figure out why that happens. > >>>>> > >>>>> Consider the following scenario and steps that explain how we end up > >>>>> with a metadata extent > >>>>> permanently corrupt and unrecoverable (when it shouldn't be possible). > >>>>> > >>>>> * We have a RAID5 filesystem consisting of three devices, with device > >>>>> IDs of 1, 2 and 3; > >>>>> > >>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs); > >>>>> > >>>>> * We have a single metadata block group that starts at logical offset > >>>>> 38797312 and has a > >>>>> length of 715784192 bytes. > >>>>> > >>>>> The following steps lead to a permanent corruption of a metadata extent: > >>>>> > >>>>> 1) We make device 3 unavailable and mount the filesystem in degraded > >>>>> mode, so only > >>>>> devices 1 and 2 are online; > >>>>> > >>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls > >>>>> within the full stripe that starts at logical address 38928384, which is > >>>>> composed of 3 stripes, each with a size of 64Kb: > >>>>> > >>>>> [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [ > >>>>> stripe 3, offset 39059456 ] > >>>>> (the offsets are logical addresses) > >>>>> > >>>>> stripe 1 is in device 2 > >>>>> stripe 2 is in device 3 > >>>>> stripe 3 is in device 1 (this is the parity stripe) > >>>>> > >>>>> Our extent buffer 39043072 falls into stripe 2, starting at page > >>>>> with index 12 > >>>>> of that stripe and ending at page with index 15; > >>>>> > >>>>> 3) When writing the new extent buffer at address 39043072 we obviously > >>>>> don't write > >>>>> the second stripe since device 3 is missing and we are in degraded > >>>>> mode. We write > >>>>> only the stripes for devices 1 and 2, which are enough to recover > >>>>> stripe 2 content > >>>>> when it's needed to read it (by XORing stripes 1 and 3, we produce > >>>>> the correct > >>>>> content of stripe 2); > >>>>> > >>>>> 4) We unmount the filesystem; > >>>>> > >>>>> 5) We make device 3 available and then mount the filesystem in > >>>>> non-degraded mode; > >>>>> > >>>>> 6) Due to some write operation (such as relocation like btrfs/125 > >>>>> does), we allocate > >>>>> a new extent buffer at logical address 38993920. This belongs to > >>>>> the same full > >>>>> stripe as the extent buffer we allocated before in degraded mode (39043072), > >>>>> and it's mapped to stripe 2 of that full stripe as well, > >>>>> corresponding to page > >>>>> indexes from 0 to 3 of that stripe; > >>>>> > >>>>> 7) When we do the actual write of this stripe, because it's a partial > >>>>> stripe write > >>>>> (we aren't writing to all the pages of all the stripes of the full > >>>>> stripe), we > >>>>> need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and > >>>>> all the pages of stripe 1 from disk in order to compute the content for the > >>>>> parity stripe. So we submit bios to read those pages from the corresponding > >>>>> devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we > >>>>> assume whatever we read from the devices is valid - in this case what we read > >>>>> from device 3, to which stripe 2 is mapped, is invalid since in the degraded > >>>>> mount we haven't written extent buffer 39043072 to it - so we get > >>>>> garbage from > >>>>> that device (either a stale extent, a bunch of zeroes due to trim/discard or > >>>>> anything completely random). Then we compute the content for the > >>>>> parity stripe > >>>>> based on that invalid content we read from device 3 and write the > >>>>> parity stripe > >>>>> (and the other two stripes) to disk; > >>>>> > >>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in > >>>>> degraded mode), but what we get from device 3 is invalid (this extent buffer > >>>>> belongs to a stripe of device 3, remember step 2), so > >>>>> btree_read_extent_buffer_pages() > >>>>> triggers a recovery attempt - this happens through: > >>>>> > >>>>> btree_read_extent_buffer_pages() -> read_extent_buffer_pages() -> > >>>>> -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> > >>>>> -> raid56_parity_recover() > >>>>> > >>>>> This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity > >>>>> stripe) by XORing the content of these last two. However the parity > >>>>> stripe was > >>>>> recomputed at step 7 using invalid content from device 3 for stripe 2, so the > >>>>> rebuilt stripe 2 still has invalid content for the extent buffer 39043072. > >>>>> > >>>>> This results in the impossibility to recover an extent buffer and > >>>>> getting permanent > >>>>> metadata corruption. If the read of the extent buffer 39043072 > >>>>> happened before the > >>>>> write of extent buffer 38993920, we would have been able to recover it since the > >>>>> parity stripe reflected correct content, it matched what was written in degraded > >>>>> mode at steps 2 and 3. > >>>>> > >>>>> The same type of issue happens for data extents as well. > >>>>> > >>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only > >>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size). > >>>>> > >>>>> And we don't need to do writes in degraded mode and then mount in non-degraded > >>>>> mode with the previously missing device for this to happen (I gave the example > >>>>> of degraded mode because that's what btrfs/125 exercises). > >>>> > >>>> This also means, other raid5/6 implementations are also affected by the > >>>> same problem, right? > >>> > >>> If so, that makes them less useful as well. > >>> For all the other raid modes we support, which use mirrors, we don't > >>> have this problem. If one copy is corrupt, we are able to recover it, > >>> period. > >>> > >>>> > >>>>> > >>>>> Any scenario where the on disk content for an extent changed (some bit flips for > >>>>> example) can result in a permanently unrecoverable metadata or data extent if we > >>>>> have the bad luck of having a partial stripe write happen before an attempt to > >>>>> read and recover a corrupt extent in the same stripe. > >>>>> > >>>>> Zygo had a report some months ago where he experienced this as well: > >>>>> > >>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@xxxxxxxxxxxxxx/ > >>>>> > >>>>> Haven't tried his script to reproduce, but it's very likely it's due to this > >>>>> issue caused by partial stripe writes before reads and recovery attempts. > >>>>> > >>>>> This is a problem that has been around since raid5/6 support was added, and it > >>>>> seems to me it's something that was not thought about in the initial design. > >>>>> > >>>>> The validation/check of an extent (both metadata and data) happens at a higher > >>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower > >>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that > >>>>> fails validation. > >>>>> > >>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all > >>>>> imply: > >>>>> > >>>>> 1) Attempts to validate all extents of a stripe before doing a partial write, > >>>>> which not only would be a performance killer and terribly complex, ut would > >>>>> also be very messy to organize this in respect to proper layering of > >>>>> responsabilities; > >>>> > >>>> Yes, this means raid56 layer will rely on extent tree to do > >>>> verification, and too complex. > >>>> > >>>> Not really worthy to me too. > >>>> > >>>>> > >>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that > >>>>> it never allocates an extent from a stripe that already has extents that were > >>>>> allocated by past transactions. However data extent allocation is currently > >>>>> done without holding a transaction open (and forgood reasons) during > >>>>> writeback. Would need more thought to see how viable it is, but not trivial > >>>>> either. > >>>>> > >>>>> Any thoughts? Perhaps someone else was already aware of this problem and > >>>>> had thought about this before. Josef? > >>>> > >>>> What about using sector size as device stripe size? > >>> > >>> Unfortunately that wouldn't work as well. > >>> > >>> Say you have stripe 1 with a corrupt metadata extent. Then you do a > >>> write to a metadata extent located at stripe 2 - this partial write > >>> (because it doesn't cover all stripes of the full stripe), will read > >>> the pages from the first stripe and assume they are all good, and then > >>> use those for computing the parity stripe - based on a corrupt extent > >>> as well. Same problem I described, but this time the corrupt extent is > >>> in a different stripe of the same full stripe. > >> > >> Yep, I also recognized that problem after some time. > >> > >> Another possible solution is, always write 0 bytes for pinned extents > >> (I'm only thinking metadata yet). > >> > >> This means, at transaction commit time, we also need to write 0 for > >> pinned extents before next transaction starts. > > > > I'm assuming you mean to write zeroes to pinned extents when unpinning > > them- after writing the superblock of the committing transaction. > > So the timing of unpinning them would also need to be changed. > > As mentioned, it needs to be before starting next transaction. That would be terrible for performance. Any operation that needs to start a transaction would block until the current transaction finishes its commit, which means waiting for a lot of IO. Would bring terrible stalls to applications - most operations need to start/join a transaction - create file, symlink, link, unlink, rename, clone/dedupe, mkdir, rmdir, mknod, truncate, create/delete snapshot, etc etc etc. > > Anyway, my point is, if we ensure all unwritten data contains certain > pattern (for metadata), then we can at least use them to detect out of > sync full stripe. That would also cause extra IO to be done if the device doesn't support trim/discard, as we would to have to write the zeroes. Sadly, it wouldn't solve the data extents case. Thanks. > > Thanks, > Qu > > > But > > way before that, the next transaction may have already started, and > > metadata and data writes may have already started as well, think of > > fsync() or writepages() being called by the vm due to memory pressure > > for example (or page migration, etc). > > > >> This needs some extra work, and will definite re-do a lot of parity > >> re-calculation, which would definitely affect performance. > >> > >> So for a new partial write, before we write the new stripe, we read the > >> remaining data stripes (which we already need) and the parity stripe > >> (the new thing). > >> > >> We do the rebuild, if the rebuild result is all zero, then it means the > >> full stripe is OK, we do regular write. > >> > >> If the rebuild result is not all zero, it means the full stripe is not > >> consistent, do some repair before write the partial stripe. > >> > >> However this only works for metadata, and metadata is never all 0, so > >> all zero page can be an indicator. > >> > >> How this crazy idea looks? > > > > Extremely crazy :/ > > I don't see how it would work due to the above comment. > > > > thanks > > > >> > >> Thanks, > >> Qu > >> > >>> > >>> Thanks. > >>> > >>>> > >>>> It would make metadata scrubbing suffer, and would cause performance > >>>> problems I guess, but it looks a little more feasible. > >>>> > >>>> Thanks, > >>>> Qu > >>>> > >>>>> > >>>>> Thanks. > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
