Re: Adventures in btrfs raid5 disk recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
> <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux