Re: Is it logical to use a disk that scrub fails but smartctl succeeds?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Dec 11, 2019 at 6:11 AM Cerem Cem ASLAN <ceremcem@xxxxxxxxxxxx> wrote:
>
> This is the second time after a year that the server's disk throws
> "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> along with some corrected errors. However, "smartctl -x" displays
> "SMART overall-health self-assessment test result: PASSED".
>
> Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> "time to replace the disk" or are those unrelated events?
>
> Thanks in advance.

This is a bit old, and there are more recent papers on better
approaches. But as it relates to only SMART attributes correlating to
failures, it demonstrates there's a big window where failures can
happen and SMART gives no advance warning.
https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf

If you are doing 'smartctl -t long' or similarly have smartd
configured to do the long test periodically, and if that test never
shows a failure, that means the drive thinks it's doing a good job :D
If you assume the drive's error detection is working, then no errors
detected by the drive means the data on the drive is the data the
drive computed the checksum on. That leaves the drive's own
controller, memory cache, and everything before that (connectors,
cables, logic board controller, logic board RAM, probably not CPU
memory or the CPU itself or you'd have a ton of problems) which could
contribute to corruption of the data that Btrfs could detect that the
drive firmware will assume is correct.

-- 
Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux