On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@xxxxxxxxxx> wrote: > > > > > 7) When we do the actual write of this stripe, because it's a partial > > stripe write > > (we aren't writing to all the pages of all the stripes of the full > > stripe), we > > need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and > > all the pages of stripe 1 from disk in order to compute the content for the > > parity stripe. So we submit bios to read those pages from the corresponding > > devices (we do this at raid56.c:raid56_rmw_stripe()). > > > > The problem is that we > > assume whatever we read from the devices is valid - > > Any idea why we have to assume here, shouldn't the csum / parent > transit id verification fail at this stage? I think we're not on the same page. Have you read the whole e-mail? At that stage, or any other stage during a partial stripe write, there's no verification, that's the problem. The raid5/6 layer has no information about which other parts of a stripe may be allocated to an extent (which can be either metadata or data). Getting such information and then doing the checks is expensive and complex. We do validate an extent from a higher level (not in raid56.c) when we read the extent (at btree_readpage_end_io_hook() and btree_read_extent_buffer_pages()), and then if something is wrong with it we attempt the recovery - in the case of raid56, by rebuilding the stripe based on the remaining stripes. But if a write into another extent of the same stripe happens before we attempt to read the corrupt extent, we end up not being able to recover the extent, and permanently corrupt destroy that possibility by overwriting the parity stripe with content that was computed based on a corrupt extent. That's why I was asking for suggestions, because it's nor trivial to do it without having a significant impact on performance and complexity. About why we don't do it, I suppose the original author of our raid5/6 implementation never thought about that it could lead to a permanent corruption. > > There is raid1 test case [1] which is more consistent to reproduce. > [1] https://patchwork.kernel.org/patch/11475417/ > looks like its result of avoiding update to the generation for nocsum > file data modifications. Sorry, I don't see what's the relation. The problem I'm exposing is exclusive to raid5/6, it's about partial stripes writes, raid1 is not stripped. Plus it's not about nodatacow/nodatacsum either, it affects both cow and nocow, and metadata as well. Thanks. > > Thanks, Anand > > > > in this case what we read > > from device 3, to which stripe 2 is mapped, is invalid since in the degraded > > mount we haven't written extent buffer 39043072 to it - so we get > > garbage from > > that device (either a stale extent, a bunch of zeroes due to trim/discard or > > anything completely random). > > > > Then we compute the content for the > > parity stripe > > based on that invalid content we read from device 3 and write the > > parity stripe > > (and the other two stripes) to disk; > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
