Re: Status of RAID5/6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
<kreijack@xxxxxxxxx> wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
>>>> is always a full-device operation.  In theory btrfs could track
>>>> modifications at the chunk level but this isn't even specified in the
>>>> on-disk format, much less implemented.
>>> It could go even further; it would be sufficient to track which
>>> *partial* stripes update will be performed before a commit, in one
>>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>>> a scrub on these stripes would be sufficient.
>
>> A scrub cannot fix a raid56 write hole--the data is already lost.
>> The damaged stripe updates must be replayed from the log.
>
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
>
> The key is that if a data write is interrupted, all the transaction is interrupted and aborted. And due to the COW nature of btrfs, the "old state" is restored at the next reboot.
>
> What is needed in any case is rebuild of parity to avoid the "write-hole" bug.

Write hole happens on disk in Btrfs, but the ensuing corruption on
rebuild is detected. Corrupt data never propagates. The problem is
that Btrfs gives up when it's detected.

If it assumes just a bit flip - not always a correct assumption but
might be reasonable most of the time, it could iterate very quickly.
Flip bit, and recompute and compare checksum. It doesn't have to
iterate across 64KiB times the number of devices. It really only has
to iterate bit flips on the particular 4KiB block that has failed csum
(or in the case of metadata, 16KiB for the default leaf size, up to a
max of 64KiB).

That's a maximum of 4096 iterations and comparisons. It'd be quite
fast. And going for two bit flips while a lot slower is probably not
all that bad either.

Now if it's the kind of corruption you get from a torn or misdirected
write, there's enough corruption that now you're trying to find a
collision on crc32c with a partial match as a guide. That'd take a
while and who knows you might actually get corrupted data anyway since
crc32c isn't cryptographically secure.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux