Re: uncorrectable errors after btrfs replace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Aug 25, 2013, at 4:10 PM, Stuart Pook <slp644161@xxxxxxx> wrote:
> 
> I emailed them to Stefan Behrens & Chris Murphy.  Please let me know if you did not get them (presumably because they are too big).

Observations:

1. The problems started before the start of the provided log.

2. smartd reports sdb at 100˚C. The spec sheet for WD2002FAEX is 60˚C. It's possible the raw value isn't actually ˚C so you'll need to look at smartctl -a columns VALUE, WORST and THRESH to determine if it is or has hit the threshold. Seems possible the drives are being cooked.

sdc is ST2000DL004 which google finds this
http://forums.seagate.com/t5/Desktop-HDD-Desktop-SSHD/BEWARE-the-so-called-Samsung-HD204UI/m-p/166856

It also looks to be running hot. 

3. the first ata error seems to be 8/10 encoding related, could be a connector problem, a port problem, a drive problem, or firmware bug - the Emask 0x10 implicates NCQ according to libata.h:
AC_ERR_NCQ              = (1 << 10), /* marker for offending NCQ qc */

4. Hundreds of these:
ata10.00: failed command: READ FPDMA QUEUED

Implies it may be an incompatibility between this drive and the controller, possibly disabling NCQ on the drive will fix the problem (set queue depth to 1)
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/550559

https://ata.wiki.kernel.org/index.php/Libata_FAQ

echo 1 > /sys/block/sdX/device/queue_depth


I can't tell you what /dev/ node applies to ata10:00 because the log is incomplete, so I don't know which drive is giving you a hard time with NCQ. Thing is, if you disable NCQ on just one drive, it'll slow it down compared to the others. I don't know how tolerant btrfs is when devices have different speeds.



5. Tens of thousands of checksum errors on both dm-11 and dm-12. 

6. Many instances of 
 btrfs: unable to fixup (regular) error at logical 53281xxxxxx on dev /dev/dm-11

So kernel messages have been screaming of bus related problems for some time, they were ignored, btrfs did what it could, reported hundreds to thousands of errors in dmesg, but user space tools didn't warn the user operations effectively failed.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux