On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote: > On 2014/11/21 06:58, Zygo Blaxell wrote: > >You have one reallocated sector, so the drive has lost some data at some > >time in the last 49000(!) hours. Normally reallocations happen during > >writes so the data that was "lost" was data you were in the process of > >overwriting anyway; however, the reallocated sector count could also be > >a sign of deteriorating drive integrity. > > > >In /var/lib/smartmontools there might be a csv file with logged error > >attribute data that you could use to figure out whether that reallocation > >was recent. > > > >I also notice you are not running regular SMART self-tests (e.g. > >by smartctl -t long) and the last (and first, and only!) self-test the > >drive ran was ~12000 hours ago. That means most of your SMART data is > >about 18 months old. The drive won't know about sectors that went bad > >in the last year and a half unless the host happens to stumble across > >them during a read. > > > >The drive is over five years old in operating hours alone. It is probably > >so fragile now that it will break if you try to move it. > All interesting points. Do you schedule SMART self-tests on your own > systems? I have smartd running. In theory it tracks changes and > sends alerts if it figures a drive is going to fail. But, based on > what you've indicated, that isn't good enough. I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. > >WARNING: errors detected during scrubbing, corrected. > >[snip] > >scrub device /dev/sdb2 (id 2) done > > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds > > total bytes scrubbed: 189.49GiB with 5420 errors > > error details: read=5 csum=5415 > > corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 > >That seems a little off. If there were 5 read errors, I'd expect the drive to > >have errors in the SMART error log. > > > >Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. > >There have been a number of fixes to csums in btrfs pulled into the kernel > >recently, and I've retired two five-year-old computers this summer due > >to RAM/CPU failures. > The difference here is that the issue only affects the one drive. > This leaves the probable cause at: > - the drive itself > - the cable/ports > > with a negligibly-possible cause at the motherboard chipset. If it was cable, there should be UDMA CRC errors or similar in the SMART counters, but they are zero. You can also try swapping the cable and seeing whether the errors move. I've found many bad cables that way. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. > > -- > __________ > Brendan Hide > http://swiftspirit.co.za/ > http://www.webafrica.co.za/?AFF1E97 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: Digital signature
