Re: Monitoring not working as "dev stats" returns 0 after read error occurred

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I agree with this. btrfs device stats /mnt should show the number of filesystem errors as well. In short I think regular users (like me for example) would want a output that shows how many fixes has been performed. E.g. each time BTRFS' repair logic kicks inn an either logs a file as corrupt or corrects a checksum failure with a second copy it should be logged.

Speaking of... it would be great if btrfs device stats /mnt (or some other command) could offer to show a list of corrupted files with paths (which I assume the filesystem know about). That would make restoring files from backup a breeze!

-Waxhead

Philip Seeger wrote:
On 1/11/20 8:42 AM, Andrei Borzenkov wrote:

On one mirror piece. It likely got correct data from another piece.

This may be a dumb question, but what mirror are we talking about here? Note that this "csum failed" message happened on a different system (than the "read error corrected" I quoted in the first message of this thread). This happened on a simple btrfs partition and by default, "data" isn't mirrored (fi df: single). I wrote it in the same thread, not to cause confusion, but simply because I saw the same problem I'm trying to describe in another configuration: "dev stats" claims no errors occurred and monitoring tools relying on "dev stats" therefore won't be able to notify admins about a bad drive before.

I think this is a serious bug. Just think about what that means. A drive goes bad, btrfs keeps fixing errors caused by that drive but as "dev stats" keeps saying "all good" (which is checked hourly or daily, by a monitoring job), no admin will be notified. A few weeks later, a second drive fails, causing data loss. The whole raid array might be gone and even if the backups are relatively up-to-date, the work from the past few hours will be lost and then there's the outage and the downtime (everything has to be shut down and restored from the backup, which might take a lot of time if the backup system is slow, maybe some parts are stored on tapes...).

In other words: What's the point of a RAID system if the admin won't be able to know that a drive is bad and has to be replaced?

This is not device-level error. btrfs got data from block device without
error. That content of data was wrong does not necessarily mean block
device problem.

I understand that it's unlikely that it was a block device problem, after all it's a new hard drive (and I ran badblocks on it which didn't find any errors). But if the drive is good and the file appears to be correct and one (of two?) checksum matched the file's contents, why was the other checksum wrong? Or could it be something completely different that triggers the same error message?

You have mirror and btrfs got correct data from another device
(otherwise you were not able to read file at all). Of course you should
be worried why one copy of data was not correct.

Which other device?

By one copy of data you mean one of two checksums (which are stored twice in case one copy gets corrupted like in this case apparently)?

Again - there was no error *reading* data from block device. Is
corruption_errs also zero?

Yes, all the error counts were zero.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux