On 2020-01-09 13:04, Nikolay Borisov wrote:
According to the log provided the error returned from the NVME device
is
BLK_STS_MEDIUM/-ENODATA hence the "critical medium" string there.
Btrfs'
code OTOH only logs error in case we it gets STS_IOERR or STS_TARGET
from the block layer. It seems there are other error codes which are
also ignored but can signify errors e.g. STS_NEXUS/STS_TRANSPORT.
Thanks for looking into this!
So as it stands this is expected but I'm not sure it's correct
behavior,
perhaps we need to extend the range of conditions we record as errors.
I don't understand how this could possibly be correct behavior.
The "device stats" command returns the error counters for a BTRFS
filesystem, just like "zfs status" returns the error counters for a ZFS
filesystem. So that's the one and only command that can be used by the
monitoring job that checks the health of the system. If the error
counters all stay at zero after device errors have occurred and that's
deemed correct behavior, how would the monitoring system be able to
notify the admin about a bad drive that should be replace before another
one goes bad, causing data loss?