Re: Uncorrectable errors on RAID-1?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Dec 21, 2014 at 01:56:54PM -0800, Robert White wrote:
> On 12/21/2014 11:34 AM, constantine wrote:
> >Some months ago I had 6 uncorrectable errors. I deleted the files that
> >contained them and then after scrubbing I had 0 uncorrectable errors.
> >After some weeks I encountered new uncorrectable errors.
> >
> >Question 1:
> >Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?
> 
> These are disk/platter/hardware errors. They happen for one of two
> reasons. (most likely) There is a flaw, new or existing, on the
> platter itself and data just cannot live in that spot. (least
> likely) You suffered an environmental hazard (hard jolt) while a
> sector was being written and the drive is just choking on the
> digital wreckage.
> 
> 
> >Question 2:
> >How do I properly correct them? (Again by deleting their files? :( )
> 
> You have to _force_ the system to write the sector. If the disk can
> correct the sector (not a hardware flaw) the problem goes away
> forever. If it can't the drive will re-map the sector with a spare
> sector and it will seem to go away forever.

   Note that one of the drives already has reallocated sectors, so
it's on its way to failing, and you should start saving up your
pennies for a new one now, even if it hasn't gone properly boom
yet. However, that doesn't explain on its own why you're getting
unrecoverable errors -- the FS should be able to deal with that.

[snip]

> The good news is that since you are using RAID1 and checksums you
> shouldn't need to delete any files. Just coerce the write and then
> btrfs scrub your filesystem and the checksum/rewrite thing should
> recover the degraded copy from the good copy in the mirror.

   If btrfs detects a checksum error, it will try to fix it by reading
the other copy and then writing good data to the broken copy
again. You don't have to force a write to the FS in order to make it
fix broken data this way. A scrub will do this check-and-repair on all
content of the filesystem.

   If the FS is reporting uncorrectable errors, then it's tried both
copies and both fail their checksums. This is basically not fixable
without removing the files and replacing them with copies from your
backup. It's not obvious why you've got correlated errors on two
devices, though, and I'm not sure how to work it out.

   I'd suggest running the full SMART tests on the disks, and running
a scrub on the FS, and checking your logs for SATA errors and similar
problems.

   Hugo.

[snip]

-- 
Hugo Mills             | I must be musical: I've got *loads* of CDs
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |                                     Fran, Black Books

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux