On Wed, Dec 11, 2019 at 6:11 AM Cerem Cem ASLAN <ceremcem@xxxxxxxxxxxx> wrote: > > This is the second time after a year that the server's disk throws > "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors > along with some corrected errors. However, "smartctl -x" displays > "SMART overall-health self-assessment test result: PASSED". > > Should we interpret "btrfs scrub"'s "uncorrectable error count" as > "time to replace the disk" or are those unrelated events? > > Thanks in advance. This is a bit old, and there are more recent papers on better approaches. But as it relates to only SMART attributes correlating to failures, it demonstrates there's a big window where failures can happen and SMART gives no advance warning. https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf If you are doing 'smartctl -t long' or similarly have smartd configured to do the long test periodically, and if that test never shows a failure, that means the drive thinks it's doing a good job :D If you assume the drive's error detection is working, then no errors detected by the drive means the data on the drive is the data the drive computed the checksum on. That leaves the drive's own controller, memory cache, and everything before that (connectors, cables, logic board controller, logic board RAM, probably not CPU memory or the CPU itself or you'd have a ton of problems) which could contribute to corruption of the data that Btrfs could detect that the drive firmware will assume is correct. -- Chris Murphy
