On Wed, Jan 08, 2020 at 06:41:36PM +0100, philip@xxxxxxxxxxxxxxxx wrote: > A bad drive caused i/o errors, but no notifications were sent out because > "btrfs dev stats" fails to increase the error counter. > > When checking dmesg, looking for something completely different, there were > some error messages indicating that this drive is causing i/o errors and may > die soon: > > BTRFS info (device sda3): read error corrected: ino 194473 off 2170880 > > But checking the stats (as generally recommended to get notified about such > errors) claims that no errors have occurred (nothing to see here, keep > moving): > > # btrfs dev stats / | grep sda3 | grep read > [/dev/sda3].read_io_errs 0 > > Why? > Isn't that the most important feature of a RAID system - to get notified > about errors, to be able to replace such a drive...? > > The fs is mirrored, so those errors didn't cause any data loss. This seems to be a result of commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac "btrfs: don't report readahead errors and don't update statistics" which assumes that readahead errors are ignored and have no side effects. It turns out this is not true: in the event of a recoverable read or csum failure, the filesystem overwrites the bad block with good data recovered by using mirror copies or parity, in both normal reads and readahead reads. Later, when a non-readahead read comes along to read the block, there's no error on the device any more because btrfs already corrected it, so no counters are updated. This is very bad--the whole point of dev stats is to make failure events observable so failing hardware can be identified and replaced, and commit 0cc068e6ee makes some of those events not count. This commit appears in kernels 5.1-rc3 and later. A simple workaround is to revert the commit, and let printk ratelimiting deal with the dmesg spam issue. I don't know if there are other errors that readahead can generate that are not IO errors (e.g. running out of memory). If there are, then there will be some false positive read error counts. These are still better than false negative read error counts. An alternative fix would be to have readahead not correct the data, so readahead truly has no side-effects. This would make the rationale of this commit more supportable--except we'd still want to know if a device was intermittently failing to read, so we should still count the readahead errors anyway. It would be nice to have a dev stats counter for corrections too. It's a pretty significant event for monitoring if we overwrite committed data for any reason, but especially so if we are doing it in response to a detected device failure. We should definitely count those. It would also be nice to throw some data into a tree every time an error occurs (extent bytenr, offset, generation, error type, event count, device or mirror ID) every time an error occurs. A tool can read this tree to pull out the extent bytenrs, map them back to paths (if the generation matches; otherwise, it's an old error and the file has since been overwritten), and produce something like the zfs status output. But that's way beyond a bug fix. > > # uname -sr > Linux 5.2.7-100.fc29.x86_64 > # btrfs --version > btrfs-progs v5.1 > >
Attachment:
signature.asc
Description: PGP signature
