On Sat, 29 Feb 2020 at 06:31, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > s/might/should I do think it is worth looking at the possibility that the "write hole", because it well documented, is being blamed for all cases that data proves to be unrecoverable when some of these may be due to a bug or bugs. From what I've found about the write hole this is because of uncertainty over which of several discs actually got written to so when copies don't match there is no way to know which one is right. In the case of a disc failure, though, surely the copy that is right is the one that doesn't involve the failed disc? Or is there something else I don't understand? > I'm curious why you had to use force, but yes that should check all of > them. If this is a mounted file system, there's 'btrfs scrub' for this > purpose though too and it can be set to run read-only on a read-only > mounted file system. In the case of 'btrfs check' the filesystem was mounted r/o but I had things reading it so didn't want to unmount it completely. It requires --force to work on a mounted filesyetem even if the mount is r/o. I did try running a scrub but had to abandon it as it wasn't proving very useful. It wasn't fixing the errors and wasn't providing any messages that would help diagnose or fix them some other way - it only seems to provide a count of the errors it didn't fix. That seems to be general thing in that there seem plenty of ways an overall 'failed' status can be returned to userspace, usually without anything being logged. That obviously makes sense if the request was to do something stupid but if instead the error return is because corruption has been found would it not be better to log an error? > That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1 > is EOL but I don't think that's related to the usage info. Still, tons > of btrfs bugs fixed between 5.1 and 5.5... > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1 > > Including raid56 specific fixes:z > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1 This was in response to posting dodgy output from btrfs fi usage. My output was from btrfs-progs v5.4 which, when I checked yesterday, seemed to be the latest. I am also running Linux 5.5.7. It may have been slightly older when the disk failed but would still have been 5.5.x Since my previous e-mail I have managed to get a 'btrfs device remove missing' to work by reading all the files from userspace, deleting those that returned I/O error and restoring from backup. Even after that the summary information is still wacky: WARNING: RAID56 detected, not implemented Overall: Device size: 16.37TiB Device allocated: 30.06GiB Device unallocated: 16.34TiB Device missing: 0.00B Used: 25.40GiB Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) is the clue in the warning message? It looks like it is failing to count any of the RAID5 blocks. Point taken about device replace. What would device replace do if the remove step failed in the same way that device remove has been failing for me recently? I'm a little disappointed we didn't get to the bottom of the bug that was causing the free space cache to become corrupted when a balance operation failed but when I asked what I could do to help I got no reply to that part of my message (not just from you, from anyone on the list).
