On 2016-05-16 02:07, Chris Murphy wrote:
Current hypothesis
"I suspected, and I still suspect that the error occurred upon a
metadata update that corrupted the checksum for the file, probably due
to silent memory corruption. If the checksum was silently corrupted,
it would be simply written to both drives causing this type of error."
A metadata update alone will not change the data checksums.
But let's ignore that. If there's corrupt extent csum in a node that
itself has a valid csum, this is functionally identical to e.g.
nerfing 100 bytes of a file's extent data (both copies, identically).
The fs doesn't know the difference. All it knows is the node csum is
valid, therefore the data extent csum is valid, and that's why it
assumes the data is wrong and hence you get an I/O error. And I can
reproduce most of your results by nerfing file data.
The entire dmesg for scrub looks like this:
May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160,
root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6
May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector
8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev
/dev/mapper/VG-b1
And the entire dmesg for running sha256sum on the file is
May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
callbacks suppressed
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
And I do get an i/o error for sha256sum and no hash is computed.
But there's two important differences:
1. I have two unable to fixup messages, one for each device, at the
exact same time.
2. I altered both copies of extent data.
It's a mystery to me how your file data has not changed, but somehow
the extent csum was changed but also the node csum was recomputed
correctly. That's a bit odd.
I would think this would be perfectly possible if some other file that
had a checksum in that node changed, thus forcing the node's checksum to
be updated. Theoretical sequence of events:
1. Some file which has a checksum in node A gets written to.
2. Node A is loaded into memory to update the checksum.
3. The new checksum for the changed extent in the file gets updated in
the in-memory copy of node A.
4. Node A has it's own checksum recomputed based on the new data, and
then gets saved to disk.
If something happened after 2 but before 4 that caused one of the other
checksums to go bad, then the checksum computed in 4 will have been with
the corrupted data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html