On 4/30/20 3:46 AM, Qu Wenruo wrote:
On 2020/4/30 上午3:21, Chris Murphy wrote:
On Wed, Apr 29, 2020 at 9:45 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote:
Short update:
1) turned out to not be btrfs fault in any way or form, as we recreated
the same issue with ext4 while manually checksumming the files; so if
anything, btrfs told us we have actual issues somewhere =)
Is that related to mixing buffered write with DIO write?
If so, maybe changing the qemu cache mode may help?
Thanks,
Qu
Well, we initially thought the issue was with VMs only - but we also
managed to hit the problem with host machine directly. As for VMs - they
are all on separate lvm volumes (raw, not as images via filesystem - if
that's what you meant in context of mixing writing modes).
The without-qemu stack looks like this:
- on the bottom 24 disk backplane connected to lsi 2308 controller (v20
firwmare - for the record I found some tidbits, that this particular
firmware versions proved problematic for some people)
- md raid5 - 4 mechanical disks using write-back journal (the journal
device is md raid1 (2 ssds in the same backplane))
- the above raid device is added to lvm vg as a pv
- this pv is used for thin pool's data and 2 other ssds (mirrored on lvm
level, physically not in backplane) are used for thin pool's metadata
Then it's a matter of simple mkfs.ext4 or mkfs.btrfs on a lv created in
the above pool. Then a 16gb file created with e.g.:
dcfldd textpattern=$(hexdump -v -n 8192 -e '1/4 "%08X"' /dev/urandom)
hash=md5 hashlog=./test.bin bs=262144 count=$((16*4096)) of=test.md5
totalhashformat="#hash#"
Will usually (though not always) produce image that will read back
(after dropping caches) with different checksum (ext4 case) or will have
btrfs scrub complaining (btrfs case).
The culprits in the file will be one - few 4kb pieces with junk in them.
This is also unlike any other sizes used across the stack (md raid -
default 512kb chunk (1.5m stripe), lvm extents: 120m, thin-pool chunks:
1.5m).
While trying to get the issue replicated, what didn't work:
- I put 4 other disks in the backlane and created another raid5 in the
same way - using the same ssds as above for its journal - no issues
- used the new md raid as lvm linear volume - no issues either
- used the new md raid for lvm thin pool (using same ssds as earlier) -
no issues either
- used the old (!?!) md raid (the one giving issues) but creating a
linear volume on it - no issues
By "no issue" I mean the above dcfldd running in a loop for 3-6 hours,
interleaved with sync/fstrim/drop_caches as appropriate.
By "issue" I mean 1-3 runs are enough to create file with silent
corruptions.
What's worse, the "working" and "non-working" cases weirdly overlap with
each other, making it hard to even reasonably pinpoint the reason (as
for example, "it always happens if I use X").
While I realize it turned out not to be exactly btrfs mailing list
material - I'll appreciate any suggestions. For now I'm planning to drop
the firmware from 20 to 19 and update kernel - and see if that happens
to help.
2) qemu/vm scenario is also not to be blamed, as we recreated the issue
directly on the host as well
So as far as I can see, both of the above narrows the potential culprits
to either faulty/buggy hardware/firmware somewhere - or - some subtle
lvm/md/kernel issues. Though so far pinpointing the issue is proving
rather frustrating.
Anyway, sorry for the noise.
It's not noise. I think it's useful to see how Btrfs can help isolate
such cases.