On 04.04.2020 16:58 Zygo Blaxell wrote: > On Fri, Apr 03, 2020 at 09:20:22AM +0200, Andrea Gelmini wrote: >> Il giorno gio 2 apr 2020 alle ore 23:23 Zygo Blaxell >> <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> ha scritto: >>> mdadm raid5/6 has no protection against the kinds of silent data >>> corruption that btrfs can detect. If the drive has a write error and >>> reports it to the host, mdadm will eject the entire disk from the array, >>> and a resync is required to put it back into the array (correcting the >>> error in the process). If the drive silently drops a write or the data >> That's not true. >> mdadm has a lot of logic of retries/wait and different "reactions" on what is >> happening. >> You can have spare blocks to use just in case, to avoid to kick the >> entire drive just >> by one bad block. > None of that helps. Well, OK, it would have prevented Filipe's specific > test case from corrupting data in the specific way it did, but that test > setup is overly complicated for this bug. 'cat /dev/urandom > /dev/sda' > is a much clearer test setup that avoids having people conflate Filipe's > bug with distracting and _totally unrelated_ bugs like raid 5/6 write > hole and a bunch of missing mdadm features. > > mdadm has no protection against silent data corruption in lower > levels of the storage stack. mdadm relies on the lower level device > to indicate errors in data integrity. If you run mdadm on top of > multiple dm-integrity devices in journal mode (double all writes!), > then dm-integrity transforms silent data corruption into EIO errors, > and mdadm can handle everything properly after that. > > Without dm-integrity (or equivalent) underneath mdadm, if one of the > lower-level devices corrupts data, mdadm can't tell which version of the > data is correct, and propagates that corruption to mirror and parity > devices. The only way to recover is to somehow know which devices > are corrupted (difficult because mdadm can't tell you which device, > and even has problems telling you that _any_ device is corrupted) and > force those devices to be resynced (which is usually a full-device sync, > unless you have some way to know where the corruption is). And you have > to do all that manually, before mdadm writes anywhere _near_ the data > you want to keep. > > btrfs has integrity checks built in, so in the event of a data corruption, > btrfs can decide whether the data or parity/mirror blocks are correct, > and btrfs can avoid propagating corruption between devices (*). The bug > in this case is that btrfs is currently not doing the extra checks > needed for raid5/6, so we currently get mdadm-style propagation of data > corruption to parity blocks. Later, btrfs detects the data csum failure > but by then parity has been corrupted and it is too late to recover. > > (*) except nodatasum files, they can be no better than mdadm, and are > currently substantially worse in btrfs. These files are where the missing > pieces of mdadm in btrfs are most obvious. But that's a separate issue > that is also _totally unrelated_ to the bug(s) Filipe and I found, since > all the data we are trying to recover has csums and can be recovered > without any of the mdadm device-state-tracking stuff. It would be interesting if/how the Synology btrfs+mdraid self-healing hack handles this (here is someone trying to corrupt this https://daltondur.st/syno_btrfs_1/). I still could not find the source code, though. > >> It has a write journal log, to avoid RAID5/6 write hole (since years, >> but people keep >> saying there's no way to avoid it on mdadm...) > Yeah, if btrfs is going to copy the broken parts of mdadm, it should > also copy the fixes... > >> Also, the greatest thing to me, Neil Brown did an incredible job >> constantly (year by year) >> improving the logic of mdadm (tools and kernel) to make it more robust against >> users mistakes and protective/proactive on failing setup/operations >> emerging from reports on >> mailig list. >> >> Until I read the mdadm mailing list, the flow was: user complains for >> software/hardware problem, >> after a while Neil commit to avoid the same problem in the future. > mdadm does one thing very well, but only the one thing. I don't imagine > Neil would extend mdadm to the point where it can handle handle silent > data corruption on cheap SSDs or workaround severe firmware bugs in > write caching. That sounds more like a feature I'd expect to come out > of VDO or bcachefs work. > >> Very costructive and useful way to manage the project. >> >> A few times I was saved by the tools warning: "you're doing a stupid >> thing, that could loose your >> date. But if you are sure, you can use --force". >> Or the kernel complaining about: "I'm not going to assemble this. Use >> --force if you're sure". >> >> On BTRFS, Qu is doing the same great job. Lots of patches to address >> users problems. >> >> Kudos to Qu!
