Re: Is it logical to use a disk that scrub fails but smartctl succeeds?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Dec 11, 2019 at 11:37 AM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 11, 2019 at 6:11 AM Cerem Cem ASLAN <ceremcem@xxxxxxxxxxxx> wrote:
> >
> > This is the second time after a year that the server's disk throws
> > "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> > along with some corrected errors. However, "smartctl -x" displays
> > "SMART overall-health self-assessment test result: PASSED".
> >
> > Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> > "time to replace the disk" or are those unrelated events?
> >
> > Thanks in advance.
>
> This is a bit old, and there are more recent papers on better
> approaches. But as it relates to only SMART attributes correlating to
> failures, it demonstrates there's a big window where failures can
> happen and SMART gives no advance warning.
> https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf
>
> If you are doing 'smartctl -t long' or similarly have smartd
> configured to do the long test periodically, and if that test never
> shows a failure, that means the drive thinks it's doing a good job :D
> If you assume the drive's error detection is working, then no errors
> detected by the drive means the data on the drive is the data the
> drive computed the checksum on. That leaves the drive's own
> controller, memory cache, and everything before that (connectors,
> cables, logic board controller, logic board RAM, probably not CPU
> memory or the CPU itself or you'd have a ton of problems) which could
> contribute to corruption of the data that Btrfs could detect that the
> drive firmware will assume is correct.

Last sentence is a bit sloppy wording. The drive firmware doesn't
assume the data is correct; it produced a checksum predicated on
(likely) already corrupt data; therefore the internal read back and
error detection based on that internal checksum recorded with that
sector data indicates the data is correct. Ergo, it has no way of
knowing the data is bad.

There is error detection (CRC) used between logic board controller and
the controller in the drive, because connector and cable errors are a
known source of possible problems. This may or may not be recorded or
reported by SMART in attribute 199. And it may or may not get reported
to the kernel (it really should, and probably usually is). So if you
have any of those, there's some small but non-zero chance of collision
where errors are happening but not detected. This error detection is
really a low bar, it's not intended to compensate for regular error
induced by a bad cable or connector, it's designed to be a red flag to
take action.

-- 
Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux