Re: Adventures in btrfs raid5 disk recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
<ahferroin7@xxxxxxxxx> wrote:
> On 2016-06-25 12:44, Chris Murphy wrote:
>>
>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>> <ahferroin7@xxxxxxxxx> wrote:
>>
>>> Well, the obvious major advantage that comes to mind for me to
>>> checksumming
>>> parity is that it would let us scrub the parity data itself and verify
>>> it.
>>
>>
>> OK but hold on. During scrub, it should read data, compute checksums
>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>> the checksum tree, and the parity strip in the chunk tree. And if
>> parity is wrong, then it should be replaced.
>
> Except that's horribly inefficient.  With limited exceptions involving
> highly situational co-processors, computing a checksum of a parity block is
> always going to be faster than computing parity for the stripe.  By using
> that to check parity, we can safely speed up the common case of near zero
> errors during a scrub by a pretty significant factor.

OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.




> The ideal situation that I'd like to see for scrub WRT parity is:
> 1. Store checksums for the parity itself.
> 2. During scrub, if the checksum is good, the parity is good, and we just
> saved the time of computing the whole parity block.
> 3. If the checksum is not good, then compute the parity.  If the parity just
> computed matches what is there already, the checksum is bad and should be
> rewritten (and we should probably recompute the whole block of checksums
> it's in), otherwise, the parity was bad, write out the new parity and update
> the checksum.
> 4. Have an option to skip the csum check on the parity and always compute
> it.
>>
>>
>> Even check > md/sync_action does this. So no pun intended but Btrfs
>> isn't even at parity with mdadm on data integrity if it doesn't check
>> if the parity matches data.
>
> Except that MD and LVM don't have checksums to verify anything outside of
> the very high-level metadata.  They have to compute the parity during a
> scrub because that's the _only_ way they have to check data integrity.  Just
> because that's the only way for them to check it does not mean we have to
> follow their design, especially considering that we have other, faster ways
> to check it.

I'm not opposed to this optimization. But retroactively better
qualifying my previous "major advantage" what I meant was in terms of
solving functional deficiency.



>> The much bigger problem we have right now that affects Btrfs,
>> LVM/mdadm md raid, is this silly bad default with non-enterprise
>> drives having no configurable SCT ERC, with ensuing long recovery
>> times, and the kernel SCSI command timer at 30 seconds - which
>> actually also fucks over regular single disk users also because it
>> means they don't get the "benefit" of long recovery times, which is
>> the whole g'd point of that feature. This itself causes so many
>> problems where bad sectors just get worse and don't get fixed up
>> because of all the link resets. So I still think it's a bullshit
>> default kernel side because it pretty much affects the majority use
>> case, it is only a non-problem with proprietary hardware raid, and
>> software raid using enterprise (or NAS specific) drives that already
>> have short recovery times by default.
>
> On this, we can agree.

It just came up again in a thread over the weekend on linux-raid@. I'm
going to ask while people are paying attention if a patch to change
the 30 second time out to something a lot higher has ever been
floated, what the negatives might be, and where to get this fixed if
it wouldn't be accepted in the kernel code directly.

*Ideally* I think we'd want two timeouts. I'd like to see commands
have a timer that results in merely a warning that could be used by
e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
write over those sectors". That's how bad sectors start out, they read
slower and eventually go beyond 30 seconds and now it's all link
resets. If the problem could be fixed before then... that's the best
scenario.

The 2nd timer would be, OK the controller or drive just face planted, reset.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux