Re: Adventures in btrfs raid5 disk recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 29/06/16 04:01, Chris Murphy wrote:
> Just wiping the slate clean to summarize:
> 
> 
> 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
> of *data extent* parity during a scrub with raid5. Goffredo and I have
> both reproduced it. It's a big bug. It might still be useful if
> someone else can reproduce it too.
> 
> Goffredo, can you file a bug at bugzilla.kernel.org and reference your
> bug thread?  I don't know if the key developers know about this, it
> might be worth pinging them on IRC once the bug is filed.
> 
> Unknown if it affects balance, or raid 6. And if it affects raid 6, is
> p or q corrupted, or both? Unknown how this manifests on metadata
> raid5 profile (only tested was data raid5). Presumably if there is
> metadata corruption that's fixed during a scrub, and its parity is
> overwritten with corrupt parity, the next time there's a degraded
> state, the file system would face plant somehow. And we've seen quite
> a few degraded raid5's (and even 6's) face plant in inexplicable ways
> and we just kinda go, shit. Which is what the fs is doing when it
> encounters a pile of csum errors. It treats the csum errors as a
> signal to disregard the fs rather than maybe only being suspicious of
> the fs. Could it turn out that these file systems were recoverable,
> just that Btrfs wasn't tolerating any csum error and wouldn't proceed
> further?

I believe this is the same case for RAID6 based on my experiences. I
actually wondered if the system halts were the result of a TON of csum
errors - not the actual result of those errors. Just about every system
hang when to 100% CPU usage on all cores and the system just stopped was
after a flood of csum errors. If it was only one or two (or I copied
data off via a network connection where the read rate was slower), I
found I had a MUCH lower chance of the system locking up.

In fact, now that I think about it, when I was copying data to an
external USB drive (maxed out at ~30MB/sec), I still got csum errors -
but the system never hung.

Every crash ended with the last line along the lines of "Stopped
recurring error. Your system needs rebooting". I wonder if this error
reporting was altered, that the system wouldn't go down.

Of course I have no way of testing this.....


-- 
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux