On Tue, Sep 8, 2009 at 5:53 PM, Tracy Reed<treed@xxxxxxxxxxxxxxx> wrote: > On Tue, Sep 08, 2009 at 10:22:11PM +0200, Markus Trippelsdorf spake thusly: >> I've already deleted the file in question unfortunately. >> On IRC Chris decided that either bad RAM or a harddrive error was the >> most likely reason for this chechsum mismatch. > > Which raises an interesting point: I know reiserfs had its problems > but it also turned up a lot of machines with bad RAM which contributed > to giving the fs a bad name. With more and more complicated and memory > consuming filesystem datastructures being stored in RAM, larger volumes > of RAM in systems, and RAM not really getting any more reliable will > we ever see a day where something like btrfs is not recommended for > use in any machine that doesn't have ECC? Does the filesystem do > anything to protect itself from bad hardware? Such as the checksums that started this thread? That *is* a protection against bad hardware feature. A large part of reiserfs' problem was a religious degree of "panic on inconsistency!" so failures of identical severity that might slip by unnoticed on other file systems were more likely to be noticed. Sadly shooting the messenger is still a popular sport and the qualities of BTRFS which make it more bad hardware resistant may well give it a bad reputation. I don't know that there is much that can be done about that. On Wed, Sep 9, 2009 at 3:01 AM, Jens Axboe<jens.axboe@xxxxxxxxxx> wrote: > On Wed, Sep 09 2009, Markus Trippelsdorf wrote: >> What a strange coincidence that it affected git pack files in both cases. >> It's almost too improbable... > > Probably more than a coincidence I think, the question is what though... Could this have been the same data in both cases? Either way— if the hardware was randomly corrupting high entropy blocks with very-low probability it's quite possible that you two would have seen it while anyone else who did chalked it up to some other problem. I've encountered telecom equipment where a particular packet data interacted poorly with the clock recovery hardware. "Any file transfers fine, except for this one. This one stalls and never finishes, but if I unzip it. it's fine!". Ugh. or it could be some busted ECC that always 'corrects' a particular class of perfectly valid blocks to something wrong... or it could be a million other things. At the end of the day you just need to accept that the hardware is junk. Black list it, give the vendor the best black eye that you can, and move on. I can only expect that this is going to get worse over time. I really wish that it had become the norm for drive makers to expose an optional raw interface to the flash. Alas, we're stuck with the equivalent of running Linux on a hypervisor provided by Microsoft... except the SSD makers are less experienced. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
