On 2019/4/6 上午3:32, Hugo Mills wrote: > On Fri, Apr 05, 2019 at 10:11:57PM +0300, Nazar Mokrynskyi wrote: >> NOTE: I do not need help with recovery, I have fully automated snapshots, backups and restoration mechanisms, the only purpose of this email is to help developers find the reason of yet another filesystem corruption and hopefully fix it. > > That's good news, at least. > >> Yet another corruption of my root BTRFS filesystem happened today. >> Didn't bother to run scrub, balance or check, just created disk image for future investigation and restored everything from backup. >> >> Here is what corruption looks like: >> [ 274.241339] BTRFS info (device dm-0): disk space caching is enabled >> [ 274.241344] BTRFS info (device dm-0): has skinny extents >> [ 274.283238] BTRFS info (device dm-0): enabling ssd optimizations >> [ 310.436672] BTRFS critical (device dm-0): corrupt leaf: root=268 block=42044719104 slot=123, bad key order, prev (1240717 108 41447424) current (1240717 76 41451520) > > "Bad key order" is usually an indicator of faulty RAM -- a piece of > metadata gets loaded into RAM for modification, a bit gets flipped in > it (because the bit is stuck on one value), and then the csum is > computed for the page (including the faulty bit), and written out to > disk. In this case, it's not obvious, but I'd suggest that the second > field of the key has been flipped, as 108 is 0x6c, and 76 is 0x4c -- > one bit away from each other. Furthermore, 108 is EXTENT_DATA_KEY, a completely valid type, while there is no key type assigned to 76. > > I recommend you check your hardware thoroughly before attempting to > rebuild the FS. Hugo's completely right. Very much a symptom of memory bit flip. > > Hugo. > >> [ 310.449304] BTRFS critical (device dm-0): corrupt leaf: root=268 block=42044719104 slot=123, bad key order, prev (1240717 108 41447424) current (1240717 76 41451520) >> [ 310.449309] BTRFS: error (device dm-0) in btrfs_dropa_snapshot:9250: errno=-5 IO failure >> [ 310.449311] BTRFS info (device dm-0): forced readonly >> [ 311.266789] BTRFS info (device dm-0): delayed_refs has NO entry >> [ 311.277088] BTRFS error (device dm-0): cleaner transaction attach returned -30 >> >> My system just freezed when I was not looking at it and this is the state it is in now. >> File system survived from March 8th til April 05, one of the fastest corruptions in my experience. >> >> Looks like this happened during sending incremental snapshot to the other BTRFS filesystem, since last snapshot on that one was not read-only as it should have been otherwise. >> >> I'm on Ubuntu 19.04 with Linux kernel 5.0.5 and btrfs-progs v4.20.2. >> >> My filesystem is on top of LUKS on NVMe SSD (SM961), I have 3 snapshots created every 15 minutes from 3 subvolumes with rotation of old snapshots (can be from tens to hundreds of snapshots at any time). >> >> Mount options: compress=lzo,noatime,ssd >> >> I have full disk image with corrupted filesystem and will create Qcow2 snapshots of it, so if you want me to run any experiments, including potentially destructive, including usage of custom patches to btrfs-progs to find out the reason of corruption, would be happy to help as much as I can. >> >> P.S. I'm riding latest stable and rc kernels all the time and during last 6 months I've got about as many corruptions of different BTRFS filesystems as during 3 years before that, really worrying if you ask me. This is because btrfs is way more strict on any possible corruption (sometimes too strict during development cycle, but this time is a real problem). I'm afraid it will report more and more problem, but at least next time, it won't cause mount failure, but transaction abort before writing bad data into disk. Thanks, Qu >> >
Attachment:
signature.asc
Description: OpenPGP digital signature
