On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote: > On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote: > > If a drive's embedded controller RAM fails, you get corruption on the > > majority of reads from a single disk, and most writes will be corrupted > > (even if they were not before). > > Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite > a number of disks for nearly 10 years now, I'd have never stumbled on > such a case of breakage so far... > > Actually most cases are as simple as HDD fails to work and this is > properly signalled to the controller. I administer no real storage at this time, and got only 16 disks (plus a few disk-likes) to my name right now. Yet in a ~2 months span I've seen three cases of silent data corruption: * a RasPi I used for DNS recursor/DHCP/aiccu started mangling some writes, with no notification that something is amiss. With ext4 being a silentdatalossfs, there was no clue it was a disk (ok, SD) problem at all, making it really "fun" to debug. Happens on multiple SD cards, thus it's the machine that's at fault. * a HDD had some link resets and silent data corruption, diagnosed to a bad SATA cable, the disk works fine since (obviously after extensive tests). * a HDD that has link resets and silent data corruption (apparently write-time only(?)), Marduk knows why. Happens with multiple cables and two machines, putting the blame somewhere on the disk. Thus, assumption that the controller will be notified about read errors is quite invalid. In the above cases, if recovery was possible it'd be beneficial to rewrite a good copy of the data. Meow! -- The bill declaring Jesus as the King of Poland fails to specify whether the addition is at the top or end of the list of kings. What should the historians do? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
