On 2016-05-16 23:42, Chris Murphy wrote:
On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochner <lochner@xxxxxxxxxx> wrote:
Chris,
It has actually happened to me three times that I know of in ~7mos.,
but your point about the "larger footprint" for data corruption is a
good one. No doubt I have silently experienced that too.
I dunno three is a lot to have the exact same corruption only in
memory then written out into two copies with valid node checksums; and
yet not have other problems, like a node item, or uuid, or xattr or
any number of other item or object types all of which get checksummed.
I suppose if the file system contains large files, the % of metadata
that's csums could be the 2nd largest footprint. But still.
Assuming that the workload on the volume is mostly backup images like
the file that originally sparked this discussion, then inodes, xattrs,
and even UUID's would be nowhere near as common as metadata blocks just
containing checksums. The fact that this hasn't hit any metadata
checksums is unusual, but not impossible.
Three times in 7 months, if it's really the same vector, is just short
of almost reproducible. Ha. It seems like if you merely balanced this
file system a few times, you'd eventually stumble on this. And if
that's true, then it's time for debug options and see if it can be
caught in action, and whether there's a hardware or software
explanation for it.
And, as you
suggest, there is no way to prevent those errors. If the memory to be
written to disk gets corrupted before its checksum is calculated, the
data will be silently corrupted, period.
Well, no way in the present design, maybe.
If the RAM is bad, there is no way we can completely protect user data,
period. We can try to mitigate certain situations, but we cannot
protect against all forms of memory corruption.
Clearly, I won't rely on this machine to produce any data directly that
I would consider important at this point.
One odd thing to me is that if this is really due to undetected memory
errors, I'd think this system would crash fairly often due to detected
"parity errors." This system rarely crashes. It often runs for
several months without an indication of problems.
I think you'd have other problems. Only data csums are being corrupt
after they're read in, but before the node csum is computed? Three
times? Pretty wonky.
Running regularly for several months without ECC RAM may be part of the
issue. Minute electrical instabilities build up over time, as do
instabilities caused by background radiation, and beyond a certain point
(which is based on more factors than are practical to compute), you end
up almost certain to have at least a single bit error.
On that note, I'd actually be curious to see how far off the checksum is
(how many bits aren't correct). Given that there are no other visible
issues with the system, I'd expect it to only be one or at most two bits
that are incorrect.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html