On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn <ahferroin7@xxxxxxxxx> wrote: > On 2016-06-25 12:44, Chris Murphy wrote: >> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >> <ahferroin7@xxxxxxxxx> wrote: >> >>> Well, the obvious major advantage that comes to mind for me to >>> checksumming >>> parity is that it would let us scrub the parity data itself and verify >>> it. >> >> >> OK but hold on. During scrub, it should read data, compute checksums >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >> the checksum tree, and the parity strip in the chunk tree. And if >> parity is wrong, then it should be replaced. > > Except that's horribly inefficient. With limited exceptions involving > highly situational co-processors, computing a checksum of a parity block is > always going to be faster than computing parity for the stripe. By using > that to check parity, we can safely speed up the common case of near zero > errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. > The ideal situation that I'd like to see for scrub WRT parity is: > 1. Store checksums for the parity itself. > 2. During scrub, if the checksum is good, the parity is good, and we just > saved the time of computing the whole parity block. > 3. If the checksum is not good, then compute the parity. If the parity just > computed matches what is there already, the checksum is bad and should be > rewritten (and we should probably recompute the whole block of checksums > it's in), otherwise, the parity was bad, write out the new parity and update > the checksum. > 4. Have an option to skip the csum check on the parity and always compute > it. >> >> >> Even check > md/sync_action does this. So no pun intended but Btrfs >> isn't even at parity with mdadm on data integrity if it doesn't check >> if the parity matches data. > > Except that MD and LVM don't have checksums to verify anything outside of > the very high-level metadata. They have to compute the parity during a > scrub because that's the _only_ way they have to check data integrity. Just > because that's the only way for them to check it does not mean we have to > follow their design, especially considering that we have other, faster ways > to check it. I'm not opposed to this optimization. But retroactively better qualifying my previous "major advantage" what I meant was in terms of solving functional deficiency. >> The much bigger problem we have right now that affects Btrfs, >> LVM/mdadm md raid, is this silly bad default with non-enterprise >> drives having no configurable SCT ERC, with ensuing long recovery >> times, and the kernel SCSI command timer at 30 seconds - which >> actually also fucks over regular single disk users also because it >> means they don't get the "benefit" of long recovery times, which is >> the whole g'd point of that feature. This itself causes so many >> problems where bad sectors just get worse and don't get fixed up >> because of all the link resets. So I still think it's a bullshit >> default kernel side because it pretty much affects the majority use >> case, it is only a non-problem with proprietary hardware raid, and >> software raid using enterprise (or NAS specific) drives that already >> have short recovery times by default. > > On this, we can agree. It just came up again in a thread over the weekend on linux-raid@. I'm going to ask while people are paying attention if a patch to change the 30 second time out to something a lot higher has ever been floated, what the negatives might be, and where to get this fixed if it wouldn't be accepted in the kernel code directly. *Ideally* I think we'd want two timeouts. I'd like to see commands have a timer that results in merely a warning that could be used by e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to write over those sectors". That's how bad sectors start out, they read slower and eventually go beyond 30 seconds and now it's all link resets. If the problem could be fixed before then... that's the best scenario. The 2nd timer would be, OK the controller or drive just face planted, reset. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
