Re: btrfs and ECC RAM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(apologies for messing up the threading; I thought I could get away with not subscribing.  I've subscribed now.)

> Martin Steigerwald <Martin <at> lichtvoll.de> wrote:Am Samstag, 18. Januar 2014, 07:16:42 schrieb :
> I think Ian refers to the slight chance that BTRFS assumes the checksum on one 
> disk to be incorrect due to a memory error 
> *and*
>  on another disk to be correct 
> due to another memory error 
> *and*
>  will silently rewrite the incorrect data to 
> the correct data.
> 
> AFAIK BTRFS still does not correct such errors automatically, but only on a 
> scrub. There this 
> *could*
>  happen theoretically.
> 
> My gut feeling is, that this is highly, highly unlikely.
> 
> At least not more likely than a controller writing out garbage or other such 
> hardware issues.

Actually, I hadn't fully understood this scenario; I was just asking because of what some of the ZFS people were saying.  

To clarify, what you describe could happen like this (is this what you meant?):

- Checksum is computed
- Checksum and data written to locations A and B, but location B suffers a memory corruption of the data en-route (maybe in some intermediate buffer) so is stored incorrectly on disk
- btrfs scrub then reads A, but suffers a memory error, and thinks the good data is bad
- Hence B is read, but another memory error causes the checksum to pass
- Since the checksum passed, B is written to A, overwriting the data

This requires a collision in the checksumming algorithm, so I don't think we need to worry about this case. It's at least as what could happen by chance with a random disk error, but the chance is negligible.

Another possibility is that A and B are both correctly written.  A memory error then happen when reading A from disk, triggering a read of B, which passes its checksum.  B is then written to A, but another memory error in some buffer causes a corruption.  This doesn't require a checksum collision, just frequent memory errors.  But it could indeed lead to trashing the whole FS during a scrub if memory errors are sufficiently frequent perhaps?  If you have memory errors occurring sufficiently frequently that you get two errors while processing a single block during a scrub, then your memory is probably very far gone.  On the other hand, if btrfs is reusing the same two memory buffers for reads and writes, and you happen to have errors in those buffers, then maybe this isn't so unlikely?  This could maybe be mitigated by cancelling the scrub if there are too many errors requiring rewrites.  

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux