Re: RAID5/6 permanent corruption of metadata and data extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04.04.2020 16:58 Zygo Blaxell wrote:
> On Fri, Apr 03, 2020 at 09:20:22AM +0200, Andrea Gelmini wrote:
>> Il giorno gio 2 apr 2020 alle ore 23:23 Zygo Blaxell
>> <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> ha scritto:
>>> mdadm raid5/6 has no protection against the kinds of silent data
>>> corruption that btrfs can detect.  If the drive has a write error and
>>> reports it to the host, mdadm will eject the entire disk from the array,
>>> and a resync is required to put it back into the array (correcting the
>>> error in the process).  If the drive silently drops a write or the data
>> That's not true.
>> mdadm has a lot of logic of retries/wait and different "reactions" on what is
>> happening.
>> You can have spare blocks to use just in case, to avoid to kick the
>> entire drive just
>> by one bad block.
> None of that helps.  Well, OK, it would have prevented Filipe's specific
> test case from corrupting data in the specific way it did, but that test
> setup is overly complicated for this bug.  'cat /dev/urandom > /dev/sda'
> is a much clearer test setup that avoids having people conflate Filipe's
> bug with distracting and _totally unrelated_ bugs like raid 5/6 write
> hole and a bunch of missing mdadm features.
>
> mdadm has no protection against silent data corruption in lower
> levels of the storage stack.  mdadm relies on the lower level device
> to indicate errors in data integrity.  If you run mdadm on top of
> multiple dm-integrity devices in journal mode (double all writes!),
> then dm-integrity transforms silent data corruption into EIO errors,
> and mdadm can handle everything properly after that.
>
> Without dm-integrity (or equivalent) underneath mdadm, if one of the
> lower-level devices corrupts data, mdadm can't tell which version of the
> data is correct, and propagates that corruption to mirror and parity
> devices.  The only way to recover is to somehow know which devices
> are corrupted (difficult because mdadm can't tell you which device,
> and even has problems telling you that _any_ device is corrupted) and
> force those devices to be resynced (which is usually a full-device sync,
> unless you have some way to know where the corruption is).  And you have
> to do all that manually, before mdadm writes anywhere _near_ the data
> you want to keep.
>
> btrfs has integrity checks built in, so in the event of a data corruption,
> btrfs can decide whether the data or parity/mirror blocks are correct,
> and btrfs can avoid propagating corruption between devices (*).  The bug
> in this case is that btrfs is currently not doing the extra checks
> needed for raid5/6, so we currently get mdadm-style propagation of data
> corruption to parity blocks.  Later, btrfs detects the data csum failure
> but by then parity has been corrupted and it is too late to recover.
>
> (*) except nodatasum files, they can be no better than mdadm, and are
> currently substantially worse in btrfs.  These files are where the missing
> pieces of mdadm in btrfs are most obvious.  But that's a separate issue
> that is also _totally unrelated_ to the bug(s) Filipe and I found, since
> all the data we are trying to recover has csums and can be recovered
> without any of the mdadm device-state-tracking stuff.

It would be interesting if/how the Synology btrfs+mdraid self-healing
hack handles this (here is someone trying to corrupt this
https://daltondur.st/syno_btrfs_1/). I still could not find the source
code, though.

>
>> It has a write journal log, to avoid RAID5/6 write hole (since years,
>> but people keep
>> saying there's no way to avoid it on mdadm...)
> Yeah, if btrfs is going to copy the broken parts of mdadm, it should
> also copy the fixes...
>
>> Also, the greatest thing to me, Neil Brown did an incredible job
>> constantly (year by year)
>> improving the logic of mdadm (tools and kernel) to make it more robust against
>> users mistakes and protective/proactive on failing setup/operations
>> emerging from reports on
>> mailig list.
>>
>> Until I read the mdadm mailing list, the flow was: user complains for
>> software/hardware problem,
>> after a while Neil commit to avoid the same problem in the future.
> mdadm does one thing very well, but only the one thing.  I don't imagine
> Neil would extend mdadm to the point where it can handle handle silent
> data corruption on cheap SSDs or workaround severe firmware bugs in
> write caching.  That sounds more like a feature I'd expect to come out
> of VDO or bcachefs work.
>
>> Very costructive and useful way to manage the project.
>>
>> A few times I was saved by the tools warning: "you're doing a stupid
>> thing, that could loose your
>> date. But if you are sure, you can use --force".
>> Or the kernel complaining about: "I'm not going to assemble this. Use
>> --force if you're sure".
>>
>> On BTRFS, Qu is doing the same great job. Lots of patches to address
>> users problems.
>>
>> Kudos to Qu!





[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux