On Fri, 19 Jun 2020 at 11:31, Roman Mamedov <rm@xxxxxxxxxxx> wrote: > > On Fri, 19 Jun 2020 10:08:43 +0200 > Daniel Smedegaard Buus <danielbuus@xxxxxxxxx> wrote: > > > Well, that's why I wrote having the *data* go bad, not the drive > > But data going bad wouldn't pass unnoticed like that (with reads resulting in > bad data), since drives have end-to-end CRC checking, including on-disk and > through the SATA interface. If data on-disk is somehow corrupted, that will be > a CRC failure on read, and still an I/O error for the host. > > I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data > as if nothing happened, and only when their flash lifespan is close to > depletion. > > > even though either scenario should still effectively end up yielding the > > same behavior from btrfs > > I believe that's also an assumption you'd want to test, if you want to be > through in verifying its behavior on failures or corruptions. And anyways it's > better to set up a scenario which is as close as possible to ones you'd get in > real-life. > All good and valid points — but only presupposing that each piece is behaving as advertised. For instance, a few years back, I discovered that some sort of bug allowed my SiI PMP/SATA combo to randomly read or write data incorrectly at a staggering rate when running at SATA 2 speeds under Linux, with no IO errors, and thus no warnings anywhere. I was running a zpool on the disks attached to it, and ZFS silently just kept retrying reads — and writes as well, as it read back and verified written data as well — and thus I lost no data on that occasion, simply because I was using a data checksumming filesystem. There's a record of me seeking help about it somewhere on the interwebs, probably in a Ubuntu forum, and I plugged a hole in the data destruction by forcing the controllers to run at SATA 1 speeds only. At present, I have an old Macbook Pro that is occasionally experiencing rotted SSD blocks, silently as well. I've discovered it two or three times. Perhaps due to it having been dropped quite a few times, or because of what appeared to be a bit of humidity damage around the SSD socket (I was given it for free, because it wouldn't recognize its SSD any longer, and thus not boot). Also at present, I've experienced that the M2 socket in my Ryzen rig on a B450 board will give garbage data, at least under multiple kernels, but perhaps not all, for reasons I'm guessing might be a buggy driver implementation, because I have experienced no issues with it under Windows. I've just completely stopped accessing that drive under Linux. Which is not an issue, because the SSD on that controller is for my Windows gaming needs anyway. And finally, again at present, I've seen silent data corruption on that same rig, with ZFS as the underlying FS, but my suspicion is that these are the result of overclocking the memory and stressing out the system for very long stretches, producing par2 and rar files for my archiving needs. My point is, yes, the drive and/or controller should tell me if what's being read back isn't what was once written, but my experience tells me to never actually rely on this being the case, lest I may end up with bad, unrecoverable data (had I been running md raid instead of ZFS on that bad SiI rig, my entire data archive would have been severely, silently, and irrevocably damaged at that point in time). And the fact that ZFS and btrfs both implement checksumming underlines the reality of that risk. Don't trust, check :) To be fair, I'm not trying to "fix" any of the mentioned hardware issues with ZFS or btrfs here. I just pick a data checksumming FS by default when I can, and right now I'm using ZFS on a scratch disk and getting fed up with the poor performance of ZFS, so I'm looking to use btrfs instead, as my only need right here is data checksumming, and AFAIR btrfs performs significantly better than ZFS. That's why I was verifying that it does indeed have functional data checksumming :) Cheers for the input! Daniel :) > With respect, > Roman
