I agree with this. btrfs device stats /mnt should show the number of
filesystem errors as well. In short I think regular users (like me for
example) would want a output that shows how many fixes has been
performed. E.g. each time BTRFS' repair logic kicks inn an either logs a
file as corrupt or corrects a checksum failure with a second copy it
should be logged.
Speaking of... it would be great if btrfs device stats /mnt (or some
other command) could offer to show a list of corrupted files with paths
(which I assume the filesystem know about). That would make restoring
files from backup a breeze!
-Waxhead
Philip Seeger wrote:
On 1/11/20 8:42 AM, Andrei Borzenkov wrote:
On one mirror piece. It likely got correct data from another piece.
This may be a dumb question, but what mirror are we talking about here?
Note that this "csum failed" message happened on a different system
(than the "read error corrected" I quoted in the first message of this
thread). This happened on a simple btrfs partition and by default,
"data" isn't mirrored (fi df: single).
I wrote it in the same thread, not to cause confusion, but simply
because I saw the same problem I'm trying to describe in another
configuration: "dev stats" claims no errors occurred and monitoring
tools relying on "dev stats" therefore won't be able to notify admins
about a bad drive before.
I think this is a serious bug. Just think about what that means. A drive
goes bad, btrfs keeps fixing errors caused by that drive but as "dev
stats" keeps saying "all good" (which is checked hourly or daily, by a
monitoring job), no admin will be notified. A few weeks later, a second
drive fails, causing data loss. The whole raid array might be gone and
even if the backups are relatively up-to-date, the work from the past
few hours will be lost and then there's the outage and the downtime
(everything has to be shut down and restored from the backup, which
might take a lot of time if the backup system is slow, maybe some parts
are stored on tapes...).
In other words: What's the point of a RAID system if the admin won't be
able to know that a drive is bad and has to be replaced?
This is not device-level error. btrfs got data from block device without
error. That content of data was wrong does not necessarily mean block
device problem.
I understand that it's unlikely that it was a block device problem,
after all it's a new hard drive (and I ran badblocks on it which didn't
find any errors).
But if the drive is good and the file appears to be correct and one (of
two?) checksum matched the file's contents, why was the other checksum
wrong? Or could it be something completely different that triggers the
same error message?
You have mirror and btrfs got correct data from another device
(otherwise you were not able to read file at all). Of course you should
be worried why one copy of data was not correct.
Which other device?
By one copy of data you mean one of two checksums (which are stored
twice in case one copy gets corrupted like in this case apparently)?
Again - there was no error *reading* data from block device. Is
corruption_errs also zero?
Yes, all the error counts were zero.