Re: Is it logical to use a disk that scrub fails but smartctl succeeds?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

> But if there's a scant amount of minimum necessary metadata intact,
> you will get your data off, even if there are errors in the data
> (that's one of the options in restore is to ignore errors). Whereas
> normal operation Btrfs won't hand over anything with checksum errors,
> you get EIO instead. So there's a decent chance of getting the data
> off the drive this way.
>
> First order of priority is to get data off the drive, if you don't
> have a current backup.
> Second, once you have a backup or already do, demote this drive
> entirely, and start your restore from backup using good drives.

+

> That drive is toast.. the giveaway here is the over 1000 "Current
> Pending Sectors.".. there's no point trying to convert this drive to
> DUP,, it must simply be stopped, and what files you can successfully
> copy consider lucky.

Right after those comments I changed my priority to get the data off
to a reliable location (and not converting the profile to DUP) before
renewing the drives. Luckily, merging the good files from three
mirrored machines made it possible to recover nearly all data (all
important data except a few unimportant corrupted log files). Thanks
again and again for this valuable redirection.

> Oh and last but actually I should have mentioned it first, because
> you'll want to do this right away. You might check if this drive has
> configurable SCT ERC.
>
> smartctl -l scterc /dev/

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

It seems like the drive has STC ERC support but disabled. However some
weird error is thrown with your correct syntax:

=======> INVALID ARGUMENT TO -l: scterc,1800,70

It's an interesting approach to setup long read time windows. I'll
keep this in mind even though this time I'm determined to make the
correct setup that will make such a data scraping job unnecessary.


> I wasn't clear on why your backup is supposed to be bad... BTRFS should have
> caught any errors during the backup and stopped things with I/O errors.

My strategy was setting up multiple machines that will sync with each
other over the network. Database part was easy since CouchDB has the
synchronization feature out of the box. For the rest of the system
(I'm using LXC containers per service) I would use `btrf send | btrfs
receive` every hour by rotating a single snapshot. I didn't setup a
RAID-1 profile because I thought it's not necessary in this context.

First problem was that I "hoped" the machine would just crash with
"DRDY ERR"s when the disk has *any* problems. I was hoping to be
notified on the very first error by a total system failure. Obviously
it doesn't work like that. Neither OS nor the rest of the applications
may not throw any error till it attempts to read or write to that
corrupted file. So those corruptions took place without noticing. This
was my mistake and I learnt that I should check the filesystem by
`btrfs scrub`. The "bad idea" part was this: Expecting an immediate
disk failure notification by total system crash.

Second problem is I mistakenly thought that `btrfs sub snap` would
throw the same "Input/output error"s just like `cp` does. However, it
turns out, this was not the case, which is totally logical. If we had
such a checksum control while snapshotting, a snapshot operation would
take too long. I'm just realizing that.

After monitoring those corruption events, I still think that I don't
need a RAID-1 setup in order not to loose data. However, a RAID-1
setup will greatly shorten the recovery time of the problematic node.

Now I think the good idea is: Make RAID-1, "monitor-disk-health", be
prepared to replace the disks on the fly.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux