I have a pretty complicated setup here, so first a general description: 8 HDs: 4x5TB, 2x4TB, 2x8TB Each disk is a LVM PV containing a BCACHE backing device, which then contains the BTRFS disks. All the drives then were in writeback mode on a SSD BCACHE cache partition (terrible setup, I know, but without the caching the system was getting too slow to use). I had all my data, metadata and system blocks on RAID1, but as I'm running out of space, and the new kernels are getting better RAID5/6 support recently, I've finally decided to migrate to RAID6 and was starting it off with the metadata. It was running well (I was already expecting it to be slow, so no problem there), but I had to spend some days away from the machine. Due to an air conditioning failure, the room temperature went pretty high and one of the disks decided to die (apparently only temporarily). BCACHE couldn't write to the backing device anymore, so it ejected all drives and let them cope with it by themselves. I've caught the trouble some 12h later, still away, and shut down anything accessing the disks until I could be physically there to handle the issue. After I got back and got the temperature down to acceptable levels, I've checked the failed drive, which seems to be working well after getting re-inserted, but it's of course out of date with the rest of the drives. But apparently the rest got some corruption as well when they got ejected from the cache, and I'm getting some errors I haven't been able to handle. I've gone through the steps here that helped me before when having complicated crashes on this system, but this time it wasn't enough, and I'll need some advice from people who know the BTRFS internals better than me to get this back running. I have around 20TB of data in the drives, so copying the data out is the last resort, as I'd prefer to let most of it die than to buy a few disks to fit all of that. Now on to the errors: I've tried both with the "failed" drive in (which gives me additional transid errors) and without it. Trying to mount with it gives me: [Jan 7 20:18] BTRFS info (device bcache0): enabling auto defrag [ +0.000010] BTRFS info (device bcache0): disk space caching is enabled [ +0.671411] BTRFS error (device bcache0): parent transid verify failed on 77292724051968 wanted > 1499510 found 1499467 [ +0.005950] BTRFS critical (device bcache0): corrupt leaf: root=2 block=77292724051968 slot=2, bad key order, prev (39029522223104 168 212992) current (39029521915904 168 16384) [ +0.000378] BTRFS error (device bcache0): failed to read block groups: -5 [ +0.022884] BTRFS error (device bcache0): open_ctree failed Trying without the disk (and -o degraded) gives me: [Jan 8 12:51] BTRFS info (device bcache1): enabling auto defrag [ +0.000002] BTRFS info (device bcache1): allowing degraded mounts [ +0.000002] BTRFS warning (device bcache1): 'recovery' is deprecated, use 'usebackuproot' instead [ +0.000000] BTRFS info (device bcache1): trying to use backup root at mount time[ +0.000002] BTRFS info (device bcache1): disabling disk space caching [ +0.000001] BTRFS info (device bcache1): force clearing of disk cache [ +0.001334] BTRFS warning (device bcache1): devid 2 uuid 27f87964-1b9a-466c-ac18-b47c0d2faa1c is missing [ +1.049591] BTRFS critical (device bcache1): corrupt leaf: root=2 block=77291982323712 slot=0, unexpected item end, have 685883288 expect 3995 [ +0.000739] BTRFS error (device bcache1): failed to read block groups: -5 [ +0.017842] BTRFS error (device bcache1): open_ctree failed btrfs check output (without drive): warning, device 2 is missing checksum verify failed on 77088164081664 found 715B4470 wanted 580444F6 checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 bytenr mismatch, want=77088164081664, have=274663271295232 Couldn't read chunk tree ERROR: cannot open file system I've already tried super-recover, zero-log and chunk-recover without any results, and check with --repair fails the same way as without. So, any ideas? I'll be happy to run experiments and grab more logs if anyone wants more details. And thanks for any suggestions.
