Re: Monitoring not working as "dev stats" returns 0 after read error occurred

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020-01-09 13:04, Nikolay Borisov wrote:
According to the log provided the error returned from the NVME device is BLK_STS_MEDIUM/-ENODATA hence the "critical medium" string there. Btrfs'
code OTOH only logs error in case we it gets STS_IOERR or STS_TARGET
from the block layer. It seems there are other error codes which are
also ignored but can signify errors e.g. STS_NEXUS/STS_TRANSPORT.

Thanks for looking into this!

So as it stands this is expected but I'm not sure it's correct behavior,
perhaps we need to extend the range of conditions we record as errors.

I don't understand how this could possibly be correct behavior.

The "device stats" command returns the error counters for a BTRFS filesystem, just like "zfs status" returns the error counters for a ZFS filesystem. So that's the one and only command that can be used by the monitoring job that checks the health of the system. If the error counters all stay at zero after device errors have occurred and that's deemed correct behavior, how would the monitoring system be able to notify the admin about a bad drive that should be replace before another one goes bad, causing data loss?



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux