On Tue, Jan 8, 2019 at 10:05 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > > > > On 2019/1/9 上午3:33, Thiago Ramon wrote: > > I have a pretty complicated setup here, so first a general description: > > 8 HDs: 4x5TB, 2x4TB, 2x8TB > > > > Each disk is a LVM PV containing a BCACHE backing device, which then > > contains the BTRFS disks. All the drives then were in writeback mode > > on a SSD BCACHE cache partition (terrible setup, I know, but without > > the caching the system was getting too slow to use). > > > > I had all my data, metadata and system blocks on RAID1, but as I'm > > running out of space, and the new kernels are getting better RAID5/6 > > support recently, I've finally decided to migrate to RAID6 and was > > starting it off with the metadata. > > > > > > It was running well (I was already expecting it to be slow, so no > > problem there), but I had to spend some days away from the machine. > > Due to an air conditioning failure, the room temperature went pretty > > high and one of the disks decided to die (apparently only > > temporarily). BCACHE couldn't write to the backing device anymore, so > > it ejected all drives and let them cope with it by themselves. I've > > caught the trouble some 12h later, still away, and shut down anything > > accessing the disks until I could be physically there to handle the > > issue. > > > > After I got back and got the temperature down to acceptable levels, > > I've checked the failed drive, which seems to be working well after > > getting re-inserted, but it's of course out of date with the rest of > > the drives. But apparently the rest got some corruption as well when > > they got ejected from the cache, and I'm getting some errors I haven't > > been able to handle. > > > > I've gone through the steps here that helped me before when having > > complicated crashes on this system, but this time it wasn't enough, > > and I'll need some advice from people who know the BTRFS internals > > better than me to get this back running. I have around 20TB of data in > > the drives, so copying the data out is the last resort, as I'd prefer > > to let most of it die than to buy a few disks to fit all of that. > > > > > > Now on to the errors: > > > > I've tried both with the "failed" drive in (which gives me additional > > transid errors) and without it. > > > > Trying to mount with it gives me: > > [Jan 7 20:18] BTRFS info (device bcache0): enabling auto defrag > > [ +0.000010] BTRFS info (device bcache0): disk space caching is enabled > > [ +0.671411] BTRFS error (device bcache0): parent transid verify > > failed on 77292724051968 wanted > 1499510 found 1499467 > > [ +0.005950] BTRFS critical (device bcache0): corrupt leaf: root=2 > > block=77292724051968 slot=2, bad key order, prev (39029522223104 168 > > 212992) current (39029521915904 168 16384) > > Heavily corrupted extent tree. > > And there is a very good experimental patch for you: > https://patchwork.kernel.org/patch/10738583/ > > Then go mount with "skip_bg,ro" mount option. > > Please note this can only help you to salvage data (kernel version of > btrfs-store). > > AFAIK, the corruption may affect fs trees too, so be aware of corrupted > data. > > Thanks, > Qu > > Thanks for pointing me to that patch, I've tried it and the FS mounted without issues. I've managed to get a snapshot of the folder structure and haven't noticed anything important missing, is there some way to get a list of anything that might have been corrupted, or I'll just find out as I try to access the file contents? Also, is there any hope of recovering the trees in place or should I just abandon this one and start with a new volume? It occurred to me that I might be able to run a scrub in the disk now that it's mounted, is that even possible in a situation like this, and more importantly, is it sane? and finally, thanks again for the patch, Thiago Ramon > > [ +0.000378] BTRFS error (device bcache0): failed to read block groups: -5 > > [ +0.022884] BTRFS error (device bcache0): open_ctree failed > > > > Trying without the disk (and -o degraded) gives me: > > [Jan 8 12:51] BTRFS info (device bcache1): enabling auto defrag > > [ +0.000002] BTRFS info (device bcache1): allowing degraded mounts > > [ +0.000002] BTRFS warning (device bcache1): 'recovery' is deprecated, > > use 'usebackuproot' instead > > [ +0.000000] BTRFS info (device bcache1): trying to use backup root at > > mount time[ +0.000002] BTRFS info (device bcache1): disabling disk > > space caching > > [ +0.000001] BTRFS info (device bcache1): force clearing of disk cache > > [ +0.001334] BTRFS warning (device bcache1): devid 2 uuid > > 27f87964-1b9a-466c-ac18-b47c0d2faa1c is missing > > [ +1.049591] BTRFS critical (device bcache1): corrupt leaf: root=2 > > block=77291982323712 slot=0, unexpected item end, have 685883288 > > expect 3995 > > [ +0.000739] BTRFS error (device bcache1): failed to read block groups: -5 > > [ +0.017842] BTRFS error (device bcache1): open_ctree failed > > > > btrfs check output (without drive): > > warning, device 2 is missing > > checksum verify failed on 77088164081664 found 715B4470 wanted 580444F6 > > checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 > > checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 > > bytenr mismatch, want=77088164081664, have=274663271295232 > > Couldn't read chunk tree > > ERROR: cannot open file system > > > > I've already tried super-recover, zero-log and chunk-recover without > > any results, and check with --repair fails the same way as without. > > > > So, any ideas? I'll be happy to run experiments and grab more logs if > > anyone wants more details. > > > > > > And thanks for any suggestions. > > >
