On Tue, Mar 3, 2020 at 4:40 PM Steven Fosdick <stevenfosdick@xxxxxxxxx> wrote: > > On Sat, 29 Feb 2020 at 06:31, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > > > s/might/should > > I do think it is worth looking at the possibility that the "write > hole", because it well documented, is being blamed for all cases that > data proves to be unrecoverable when some of these may be due to a bug > or bugs. From what I've found about the write hole this is because of > uncertainty over which of several discs actually got written to so > when copies don't match there is no way to know which one is right. > In the case of a disc failure, though, surely the copy that is right > is the one that doesn't involve the failed disc? Or is there > something else I don't understand? a. the write hole doesn't happen with raid1, and your metadata is raid1 so any file system corruption is not related to the write hole b. the write hole can apply to your raid5 data stripes, but this is a rare case that happens with a crash or power failure during write and causes a stripe to be incompletely rewrite when it's being modified. That's rare conventional raid5, more rare on btrfs, but it can happen. c. to actually be affected by the write hole problem, the stripe with mismatching parity strip must have a missing data strip such as a bad sector making up one of the strips, or a failed device. If neither of those are the case, it's not the write hole, it's something else. d. before there's a device or sector failure, a scrub following a crash or power loss will correct the problem resulting from the write hole > I did try running a scrub but had to abandon it as it wasn't proving > very useful. It wasn't fixing the errors and wasn't providing any > messages that would help diagnose or fix them some other way - it only > seems to provide a count of the errors it didn't fix. It can't fix them when the file system is mounted read only. > That seems to > be general thing in that there seem plenty of ways an overall 'failed' > status can be returned to userspace, usually without anything being > logged. That obviously makes sense if the request was to do something > stupid but if instead the error return is because corruption has been > found would it not be better to log an error? The most obvious case of corruption is a checksum mismatch (the onthefly checksum for a node/leaf/block compared to the recorded checksum). Btrfs always reports this. Parity strips are not checksummed. If parity is corrupt, it's only corrected on a scrub (or balance). They're not used during normal read operations. Upon degraded reads, parity is used to reconstruct data. Since there's no checksum, the parity is trusted, and bad parity will cause a corrupt reconstruction of data, and that corruption fails checksum - and Btrfs will tell you about it, and also EIO. So that leaves the less obvious cases of corruption where some metadata or data is corrupt in memory, and a valid checksum is computed on already corrupt data/metadata, and then written to disk. Now when Btrfs reads it, there's no checksum mismatch, and yet there is corruption. For metadata reads, the tree checker has gotten quite a bit better lately at sanity checking metadata. For data, well you're out of luck, the application will have to sanity check it and if not, then the data is just corrupt - but it's no different than any other file system. At least Btrfs gave you a chance. But that's the gotcha of bad RAM or other sources of bit flips in the storage stack. > > That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1 > > is EOL but I don't think that's related to the usage info. Still, tons > > of btrfs bugs fixed between 5.1 and 5.5... > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1 > > > > Including raid56 specific fixes:z > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1 > > This was in response to posting dodgy output from btrfs fi usage. My > output was from btrfs-progs v5.4 which, when I checked yesterday, > seemed to be the latest. I am also running Linux 5.5.7. It may have > been slightly older when the disk failed but would still have been > 5.5.x >From six days ago, your dmesg: Sep 27 15:16:08 meije kernel: Not tainted 5.1.10-arch1-1-ARCH #1 Actually what I should have asked is whether you ever ran 5.2 - 5.2.14 kernels because that series had a known corruption bug in it, fixed in 5.2.15 > Since my previous e-mail I have managed to get a 'btrfs device remove > missing' to work by reading all the files from userspace, deleting > those that returned I/O error and restoring from backup. Even after > that the summary information is still wacky: > > WARNING: RAID56 detected, not implemented > Overall: > Device size: 16.37TiB I > Device allocated: 30.06GiB > Device unallocated: 16.34TiB > Device missing: 0.00B > Used: 25.40GiB > Free (estimated): 0.00B (min: 8.00EiB) > Data ratio: 0.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > is the clue in the warning message? It looks like it is failing to > count any of the RAID5 blocks. I think btrf filesystem usage doesn't completely support raid56 is all it's saying. 'btrfs fi df' and 'btrfs fi show' should show things correctly > > Point taken about device replace. What would device replace do if the > remove step failed in the same way that device remove has been failing > for me recently? I don't understand the question. The device replace command includes 'device add' and 'device remove' in one step, it just lacks the implied resize that happens with add and remove. > I'm a little disappointed we didn't get to the bottom of the bug that > was causing the free space cache to become corrupted when a balance > operation failed but when I asked what I could do to help I got no > reply to that part of my message (not just from you, from anyone on > the list). The free space cache isn't that important. It can be discarded and reconstructed. It's an optimization. I don't think it's checksummed anyway, instead corruption is determined by mismatching generation/transid? So it may not literally be corrupt, it's just ambiguous whether it can be relied upon, therefore it's marked for reconstruction. -- Chris Murphy
