On 29/06/16 04:01, Chris Murphy wrote: > Just wiping the slate clean to summarize: > > > 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption > of *data extent* parity during a scrub with raid5. Goffredo and I have > both reproduced it. It's a big bug. It might still be useful if > someone else can reproduce it too. > > Goffredo, can you file a bug at bugzilla.kernel.org and reference your > bug thread? I don't know if the key developers know about this, it > might be worth pinging them on IRC once the bug is filed. > > Unknown if it affects balance, or raid 6. And if it affects raid 6, is > p or q corrupted, or both? Unknown how this manifests on metadata > raid5 profile (only tested was data raid5). Presumably if there is > metadata corruption that's fixed during a scrub, and its parity is > overwritten with corrupt parity, the next time there's a degraded > state, the file system would face plant somehow. And we've seen quite > a few degraded raid5's (and even 6's) face plant in inexplicable ways > and we just kinda go, shit. Which is what the fs is doing when it > encounters a pile of csum errors. It treats the csum errors as a > signal to disregard the fs rather than maybe only being suspicious of > the fs. Could it turn out that these file systems were recoverable, > just that Btrfs wasn't tolerating any csum error and wouldn't proceed > further? I believe this is the same case for RAID6 based on my experiences. I actually wondered if the system halts were the result of a TON of csum errors - not the actual result of those errors. Just about every system hang when to 100% CPU usage on all cores and the system just stopped was after a flood of csum errors. If it was only one or two (or I copied data off via a network connection where the read rate was slower), I found I had a MUCH lower chance of the system locking up. In fact, now that I think about it, when I was copying data to an external USB drive (maxed out at ~30MB/sec), I still got csum errors - but the system never hung. Every crash ended with the last line along the lines of "Stopped recurring error. Your system needs rebooting". I wonder if this error reporting was altered, that the system wouldn't go down. Of course I have no way of testing this..... -- Steven Haigh Email: netwiz@xxxxxxxxx Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897
Attachment:
signature.asc
Description: OpenPGP digital signature
