On Tue, Apr 7, 2020 at 3:10 AM Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Apr 06, 2020 at 05:25:04PM +0100, Filipe Manana wrote: > > On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@xxxxxxxxxx> wrote: > > > > > > > > > > > > > 7) When we do the actual write of this stripe, because it's a partial > > > > stripe write > > > > (we aren't writing to all the pages of all the stripes of the full > > > > stripe), we > > > > need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and > > > > all the pages of stripe 1 from disk in order to compute the content for the > > > > parity stripe. So we submit bios to read those pages from the corresponding > > > > devices (we do this at raid56.c:raid56_rmw_stripe()). > > > > > > > > > > The problem is that we > > > > assume whatever we read from the devices is valid - > > > > > > Any idea why we have to assume here, shouldn't the csum / parent > > > transit id verification fail at this stage? > > > > I think we're not on the same page. Have you read the whole e-mail? > > > > At that stage, or any other stage during a partial stripe write, > > there's no verification, that's the problem. > > The raid5/6 layer has no information about which other parts of a > > stripe may be allocated to an extent (which can be either metadata or > > data). > > > > Getting such information and then doing the checks is expensive and > > complex. > > ...and yet maybe still worth doing? "Make it correct, then make it fast." Yes... I haven't started today fixing things on btrfs :) However I like to have at least an idea on how to make things perform better if I know or suspect it will cause a significant performance impact. > > I did see Qu's proposal to use a parity mismatch detection to decide when > to pull out the heavy scrub gun. It's clever, but I see that more as an > optimization than the right way forward: the correct behavior is *always* > to do the csum verification when reading a block, even if it's because we > are writing some adjacent block that is involved in a parity calculation. Scrub does the checksum verification, both for metadata and data. So it should be reliable. > > We can offer "skip the csum check when the data and parity matches" > as an optional but slightly unsafe speedup. If you are unlucky with > single-bit errors on multiple drives (e.g. they all have a firmware bug > that makes them all zero out the 54th byte in some sector) then you might > end up with a stripe that has matching parity but invalid SHA256 csums, > and maybe that becomes a security hole that only applies to btrfs raid5/6. > On the other hand, maybe you're expecting all of your errors to be random > and uncorrelated, and parity mismatch detection is good enough. > > If we adopt the "do csum verification except when we know we can avoid > it" approach, there are other options where we know we can avoid the > extra verification. We could have the allocator try to allocate full > RAID stripes so that we can skip the stripe-wide csum check because we > know we are writing all the data in the stripe. Since we would have the > proper csum check to fall back on for correctness, the allocator can fall > back to partial RAID stripes when we run out of full ones, and we don't > get some of the worst parts of the "allocate only full stripes" approach. > We do still have write hole with any partial stripe updates, but that's > probably best solved separately. > > > We do validate an extent from a higher level (not in > > raid56.c) when we read the extent (at btree_readpage_end_io_hook() and > > btree_read_extent_buffer_pages()), and then if something is wrong with > > it we attempt the recovery - in the case of raid56, by rebuilding the > > stripe based on the remaining stripes. But if a write into another > > extent of the same stripe happens before we attempt to read the > > corrupt extent, we end up not being able to recover the extent, and > > permanently corrupt destroy that possibility by overwriting the parity > > stripe with content that was computed based on a corrupt extent. > > > > That's why I was asking for suggestions, because it's nor trivial to > > do it without having a significant impact on performance and > > complexity. > > > > About why we don't do it, I suppose the original author of our raid5/6 > > implementation never thought about that it could lead to a permanent > > corruption. > > > > > > > > There is raid1 test case [1] which is more consistent to reproduce. > > > [1] https://patchwork.kernel.org/patch/11475417/ > > > looks like its result of avoiding update to the generation for nocsum > > > file data modifications. > > > > Sorry, I don't see what's the relation. > > The problem I'm exposing is exclusive to raid5/6, it's about partial > > stripes writes, raid1 is not stripped. > > Plus it's not about nodatacow/nodatacsum either, it affects both cow > > and nocow, and metadata as well. > > > > Thanks. > > > > > > > > Thanks, Anand > > > > > > > > > > in this case what we read > > > > from device 3, to which stripe 2 is mapped, is invalid since in the degraded > > > > mount we haven't written extent buffer 39043072 to it - so we get > > > > garbage from > > > > that device (either a stale extent, a bunch of zeroes due to trim/discard or > > > > anything completely random). > > > > > > > > Then we compute the content for the > > > > parity stripe > > > > based on that invalid content we read from device 3 and write the > > > > parity stripe > > > > (and the other two stripes) to disk; > > > > > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
