Re: RAID5/6 permanent corruption of metadata and data extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 3, 2020 at 1:16 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
> OK, attempt number 2.
>
> Now this time, without zero writing.
>
> The workflow would look like: (raid5 example)
> - Partial stripe write triggered
>
> - Raid56 reads out all data stripe and parity stripe
>   So far, the same routine as my previous purposal.
>
> - Re-calculate parity and compare.
>   If matches, the full stripe is fine, continue partial stripe update
>   routine.
>
>   If not matches, block any further write on the full stripe, inform
>   upper layer to start a scrub on the logical range of the full stripe.
>   Wait for that scrub to finish, then continue partial stripe update.
>
>   ^^^ This part is the most complex part AFAIK.

Since scrub figures out which extents are allocated for a stripe and
then does the checksum verification (for both metadata and data), it
triggers the same recovery path (raid56_parity_recover()) as the
regular read path when the validation fails (and it includes
validating the fsid on metadata extents, chunk tree uuid, bytenr). So
scrub should work out fine.

My initial idea, was actually only for the case of making the
previously missing device available again and then mount the fs in
non-degraded mode.
Every time a partial write was attempted, if it detected a device with
a generation (taken from the superblock in the device when the fs is
mounted)
lower than the generation of the other devices, we know it's a device
that was previously missing, and so trigger the recovery through the
same
API that is used by the validation path (raid56_parity_recover()),
which just reads the stripes of the other devices and reconstructs the
one for the
bad (previously missing) device. We have a stripe cache that caches
stripes for a while, so it wouldn't be too bad all the time.

The problem was for the case where we never mounted in degraded mode,
the stripe just got corrupted somehow, figuring out which stripe/s
is/are bad is just not possible. So it wasn't a complete solution.

Doing as you suggest, is a much better approach than any suggested before.
Sometimes it will be less expensive, when the stripes are in the cache
due to some previous write, leaving only the parity calculation and
check to do.

Running the scrub for the full stripe's range might work indeed. On my
previous tests (a modified btrfs/125) running a full scrub right after
the mount and before doing anything else seems to fix the problem.

>
>
> For full stripe update, we just update without any verification.
>
> Despite the complex in the repair routine, another problem is when we do
> partial stripe update on untouched range.
>
> In that case, we will trigger a scrub for every new full stripe, and
> downgrade the performance heavily.
>
> Ideas on this crazy idea number 2?

It's a much less crazy idea.

It brings some complications however, if a partial write is triggered
through a transaction commit and we need to run the scrub on the full
stripe, we need to take special care to not deadlock, since the
transaction pauses scrub. Would need to deal with the case of an
already running scrub, but this would likely be easier but slow
(request to pause and unpause after the repair).

Thanks

>
> Thanks,
> Qu
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux