Re: many csum warning/errors on qemu guests using btrfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/30/20 3:46 AM, Qu Wenruo wrote:


On 2020/4/30 上午3:21, Chris Murphy wrote:
On Wed, Apr 29, 2020 at 9:45 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote:

Short update:

1) turned out to not be btrfs fault in any way or form, as we recreated
the same issue with ext4 while manually checksumming the files; so if
anything, btrfs told us we have actual issues somewhere =)

Is that related to mixing buffered write with DIO write?

If so, maybe changing the qemu cache mode may help?

Thanks,
Qu


Well, we initially thought the issue was with VMs only - but we also managed to hit the problem with host machine directly. As for VMs - they are all on separate lvm volumes (raw, not as images via filesystem - if that's what you meant in context of mixing writing modes).

The without-qemu stack looks like this:

- on the bottom 24 disk backplane connected to lsi 2308 controller (v20 firwmare - for the record I found some tidbits, that this particular firmware versions proved problematic for some people) - md raid5 - 4 mechanical disks using write-back journal (the journal device is md raid1 (2 ssds in the same backplane))
- the above raid device is added to lvm vg as a pv
- this pv is used for thin pool's data and 2 other ssds (mirrored on lvm level, physically not in backplane) are used for thin pool's metadata

Then it's a matter of simple mkfs.ext4 or mkfs.btrfs on a lv created in the above pool. Then a 16gb file created with e.g.:

dcfldd textpattern=$(hexdump -v -n 8192 -e '1/4 "%08X"' /dev/urandom) hash=md5 hashlog=./test.bin bs=262144 count=$((16*4096)) of=test.md5 totalhashformat="#hash#"

Will usually (though not always) produce image that will read back (after dropping caches) with different checksum (ext4 case) or will have btrfs scrub complaining (btrfs case).

The culprits in the file will be one - few 4kb pieces with junk in them. This is also unlike any other sizes used across the stack (md raid - default 512kb chunk (1.5m stripe), lvm extents: 120m, thin-pool chunks: 1.5m).

While trying to get the issue replicated, what didn't work:

- I put 4 other disks in the backlane and created another raid5 in the same way - using the same ssds as above for its journal - no issues
- used the new md raid as lvm linear volume - no issues either
- used the new md raid for lvm thin pool (using same ssds as earlier) - no issues either - used the old (!?!) md raid (the one giving issues) but creating a linear volume on it - no issues

By "no issue" I mean the above dcfldd running in a loop for 3-6 hours, interleaved with sync/fstrim/drop_caches as appropriate.

By "issue" I mean 1-3 runs are enough to create file with silent corruptions.

What's worse, the "working" and "non-working" cases weirdly overlap with each other, making it hard to even reasonably pinpoint the reason (as for example, "it always happens if I use X").

While I realize it turned out not to be exactly btrfs mailing list material - I'll appreciate any suggestions. For now I'm planning to drop the firmware from 20 to 19 and update kernel - and see if that happens to help.



2) qemu/vm scenario is also not to be blamed, as we recreated the issue
directly on the host as well

So as far as I can see, both of the above narrows the potential culprits
to either faulty/buggy hardware/firmware somewhere - or - some subtle
lvm/md/kernel issues. Though so far pinpointing the issue is proving
rather frustrating.


Anyway, sorry for the noise.

It's not noise. I think it's useful to see how Btrfs can help isolate
such cases.






[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux