On Fri, 11 Jul 2014 10:38:22 Duncan wrote: > > I've moved all drives and move those to my main rig which got a nice > > 16GB of ecc ram, so errors of ram, cpu, controller should be kept > > theoretically eliminated. > > It's worth noting that ECC RAM doesn't necessarily help when it's an in- > transit bus error. Some years ago I had one of the original 3-digit > Opteron machines, which of course required registered and thus ECC RAM. > The first RAM I purchased for that board was apparently borderline on its > timing certifications, and while it worked fine when the system wasn't > too stressed, including with memtest, which passed with flying colors, > under medium memory activity it would very occasionally give me, for > instance, a bad bzip2 csum, and with intensive memory activity, the > problem would be worse (more bz2 decompress errors, gcc would error out > too sometimes and I'd have to restart my build, very occasionally the > system would crash). If bad RAM causes corrupt memory but no ECC error reports then it probably wouldn't be a bus error. A bus error SHOULD give ECC reports. One problem is that RAM errors aren't random. From memory the Hamming codes used fix 100% of single bit errors, detect 100% of 2 bit errors, and let some 3 bit errors through. If you have a memory module with 3 chips on it (the later generation of DIMM for any given size) then an error in 1 chip can change 4 bits. The other main problem is that if you have a read or write going to the wrong address then you lose as AFAIK there's no ECC on address lines. But I still recommend ECC RAM, it just decreases the scope for problems. About half the serious problems I've had with BTRFS have been caused by a faulty DIMM... -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
