Re: RAID5/6 permanent corruption of metadata and data extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@xxxxxxxxxx> wrote:
>
>
>
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >     (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >     all the pages of stripe 1 from disk in order to compute the content for the
> >     parity stripe. So we submit bios to read those pages from the corresponding
> >     devices (we do this at raid56.c:raid56_rmw_stripe()).
>
>
> > The problem is that we
> >     assume whatever we read from the devices is valid -
>
>    Any idea why we have to assume here, shouldn't the csum / parent
>    transit id verification fail at this stage?

I think we're not on the same page. Have you read the whole e-mail?

At that stage, or any other stage during a partial stripe write,
there's no verification, that's the problem.
The raid5/6 layer has no information about which other parts of a
stripe may be allocated to an extent (which can be either metadata or
data).

Getting such information and then doing the checks is expensive and
complex. We do validate an extent from a higher level (not in
raid56.c) when we read the extent (at btree_readpage_end_io_hook() and
btree_read_extent_buffer_pages()), and then if something is wrong with
it we attempt the recovery - in the case of raid56, by rebuilding the
stripe based on the remaining stripes. But if a write into another
extent of the same stripe happens before we attempt to read the
corrupt extent, we end up not being able to recover the extent, and
permanently corrupt destroy that possibility by overwriting the parity
stripe with content that was computed based on a corrupt extent.

That's why I was asking for suggestions, because it's nor trivial to
do it without having a significant impact on performance and
complexity.

About why we don't do it, I suppose the original author of our raid5/6
implementation never thought about that it could lead to a permanent
corruption.

>
>    There is raid1 test case [1] which is more consistent to reproduce.
>      [1] https://patchwork.kernel.org/patch/11475417/
>    looks like its result of avoiding update to the generation for nocsum
>    file data modifications.

Sorry, I don't see what's the relation.
The problem I'm exposing is exclusive to raid5/6, it's about partial
stripes writes, raid1 is not stripped.
Plus it's not about nodatacow/nodatacsum either, it affects both cow
and nocow, and metadata as well.

Thanks.

>
> Thanks, Anand
>
>
> > in this case what we read
> >     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >     mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >     that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >     anything completely random).
> >
> > Then we compute the content for the
> > parity stripe
> >     based on that invalid content we read from device 3 and write the
> > parity stripe
> >     (and the other two stripes) to disk;
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux