Re: Adventures in btrfs raid5 disk recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
<ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

>
>> It just came up again in a thread over the weekend on linux-raid@. I'm
>> going to ask while people are paying attention if a patch to change
>> the 30 second time out to something a lot higher has ever been
>> floated, what the negatives might be, and where to get this fixed if
>> it wouldn't be accepted in the kernel code directly.
>
> Defaults are defaults, they're not for everyone.  30 seconds is about
> two minutes too short for an SMR drive's worst-case write latency, or
> 28 seconds too long for an OLTP system, or just right for an end-user's
> personal machine with a low-energy desktop drive and a long spin-up time.

The question is where is the correct place to change the default to
broadly capture most use cases, because it's definitely incompatible
with consumer SATA drives, whether in an enclosure or not.

Maybe it's with the kernel teams at each distribution? Or maybe an
upstream udev rule?

In any case something needs to give here because it's been years of
bugging users about this misconfiguration and people constantly run
into it, which means user education is not working.


>
> Once a drive starts taking 30+ seconds to do I/O, I consider the drive
> failed in the sense that it's too slow to meet latency requirements.

Well that is then a mismatch between use case and the drive purchasing
decision. Consumer drives do this. It's how they're designed to work.


> When the problem is that it's already taking too long, the solution is
> not waiting even longer.  To put things in perspective, consider that
> server hardware watchdog timeouts are typically 60 seconds by default
> (if not maximum).

If you want the data retrieved from that particular device, the only
solution is waiting longer. The alternative is what you get, an IO
error (well actually you get a link reset, which also means the entire
command queue is purged on SATA drives).


> If anything, I want the timeout to be shorter so that upper layers with
> redundancy can get an EIO and initiate repair promptly, and admins can
> get notified to evict chronic offenders from their drive slots, without
> having to pay extra for hard disk firmware with that feature.

The drive totally thwarts this. It doesn't report back to the kernel
what command is hung, as far as I'm aware. It just hangs and goes into
a so called "deep recovery" there is no way to know what sector is
causing the problem until the drive reports a read error, which will
include the affected sector LBA.

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


>> *Ideally* I think we'd want two timeouts. I'd like to see commands
>> have a timer that results in merely a warning that could be used by
>> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
>> write over those sectors". That's how bad sectors start out, they read
>> slower and eventually go beyond 30 seconds and now it's all link
>> resets. If the problem could be fixed before then... that's the best
>> scenario.
>
> What's the downside of a link reset?  Can the driver not just return
> EIO for all the outstanding IOs in progress at reset, and let the upper
> layers deal with it?  Or is the problem that the upper layers are all
> horribly broken by EIOs, or drive firmware horribly broken by link resets?

Link reset clears the entire command queue on SATA drives, and it
wipes away any possibility of finding out what LBA or even a range of
LBAs, is the source of the stall. So it pretty much gets you nothing.


> The upper layers could time the IOs, and make their own decisions based
> on the timing (e.g. btrfs or mdadm could proactively repair anything that
> took more than 10 seconds to read).  That might be a better approach,
> since shortening the time to an EIO is only useful when you have a
> redundancy layer in place to do something about them.

For RAID with redundancy, that's doable, although I have no idea what
work is needed, or even if it's possible, to track commands in this
manner, and fall back to some kind of repair mode as if it were a read
error.

For single drives and RAID 0, the only possible solution is to not do
link resets for up to 3 minutes and hope the drive returns the single
copy of data.

Even in the case of Btrfs DUP, it's thwarted without a read error
reported from the drive (or it returning bad data).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux