----- Original Message ----- > From: "Qu Wenruo" <quwenruo.btrfs@xxxxxxx> > To: "STEVE LEUNG" <sjleung@xxxxxxx>, linux-btrfs@xxxxxxxxxxxxxxx > Sent: Sunday, February 10, 2019 6:52:23 AM > Subject: Re: corruption with multi-device btrfs + single bcache, won't mount > ----- Original Message ----- > From: "Qu Wenruo" <quwenruo.btrfs@xxxxxxx> > On 2019/2/10 下午2:56, STEVE LEUNG wrote: >> Hi all, >> >> I decided to try something a bit crazy, and try multi-device raid1 btrfs on >> top of dm-crypt and bcache. That is: >> >> btrfs -> dm-crypt -> bcache -> physical disks >> >> I have a single cache device in front of 4 disks. Maybe this wasn't >> that good of an idea, because the filesystem went read-only a few >> days after setting it up, and now it won't mount. I'd been running >> btrfs on top of 4 dm-crypt-ed disks for some time without any >> problems, and only added bcache (taking one device out at a time, >> converting it over, adding it back) recently. >> >> This was on Arch Linux x86-64, kernel 4.20.1. >> >> dmesg from a mount attempt (using -o usebackuproot,nospace_cache,clear_cache): >> >> [ 267.355024] BTRFS info (device dm-5): trying to use backup root at mount time >> [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache >> [ 267.355030] BTRFS info (device dm-5): disabling disk space caching >> [ 267.355032] BTRFS info (device dm-5): has skinny extents >> [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on >> 13069706166272 wanted 4196588 found 4196585 >> [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on >> 13069706166272 wanted 4196588 found 4196585 > > When this happens, there is no good way to completely recover (btrfs > check pass after the recovery) the fs. > > We should enhance btrfs-progs to handle it, but it will take some time. > >> [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5 >> [ 271.455868] BTRFS error (device dm-5): open_ctree failed >> >> btrfs check: >> >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585 >> Ignoring transid failure >> ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent level=2 >> child level=0 >> ERROR: cannot open file system >> >> Any simple fix for the filesystem? It'd be nice to recover the data >> that's hopefully still intact. I have some backups that I can dust >> off if it really comes down to it, but it's more convenient to >> recover the data in-place. > > However there is a patch to address this kinda "common" corruption scenario. > > https://lwn.net/Articles/777265/ > > In that patchset, there is a new rescue=bg_skip mount option (needs to > be used with ro), which should allow you to access whatever you still > have from the fs. > > From other reporters, such corruption is mainly related to extent tree, > thus data damage should be pretty small. Ok I think I spoke too soon. Some files are recoverable, but many cannot be read. Userspace gets back an I/O error, and the kernel log reports similar parent transid verify failed errors, with what seem to be similar generation numbers to what I saw in my original mount error. i.e. wants 4196588, found something that's off by usually 2 or 3. Occasionally there's one that's off by about 1300. There are multiple snapshots on this filesystem (going back a few days), and the same file in each snapshot seems to be equally affected, even if the file hasn't changed in many months. Metadata seems to be intact - I can stat every file in one of the snapshots and I don't get any errors back. Any other ideas? It kind of seems like "btrfs restore" would be suitable here, but it sounds like it would need to be taught about rescue=bg_skip first. Thanks for all the help. Even a partial recovery is a lot better than what I was facing before. Steve
