On 2018年01月01日 08:48, Stirling Westrup wrote: > Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK > YOU to Nikolay Borisov and most especially to Qu Wenruo! > > Thanks to their tireless help in answering all my dumb questions I > have managed to get my BTRFS working again! As I speak I have the > full, non-degraded, quad of drives mounted and am updating my latest > backup of their contents. > > I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T > drives failed, and with help I was able to make a 100% recovery of the > lost data. I do have some observations on what I went through though. > Take this as constructive criticism, or as a point for discussing > additions to the recovery tools: > > 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 > errors exactly coincided with the 3 super-blocks on the drive. WTF, why all these corruption all happens at btrfs super blocks?! What a coincident. > The > odds against this happening as random independent events is so > unlikely as to be mind-boggling. (Something like odds of 1 in 10^26) Yep, that's also why I was thinking the corruption is much heavier than our expectation. But if this turns out to be superblocks only, then as long as superblock can be recovered, you're OK to go. > So, I'm going to guess this wasn't random chance. Its possible that > something inside the drive's layers of firmware is to blame, but it > seems more likely to me that there must be some BTRFS process that > can, under some conditions, try to update all superblocks as quickly > as possible. Btrfs only tries to update its superblock when committing transaction. And it's only done after all devices are flushed. AFAIK there is nothing strange. > I think it must be that a drive failure during this > window managed to corrupt all three superblocks. Maybe, but at least the first (primary) superblock is written with FUA flag, unless you enabled libata FUA support (which is disabled by default) AND your driver supports native FUA (not all HDD supports it, I only have a seagate 3.5 HDD supports it), FUA write will be converted to write & flush, which should be quite safe. The only timing I can think of is, between the superblock write request submit and the wait for them. But anyway, btrfs superblocks are the ONLY metadata not protected by CoW, so it is possible something may go wrong at certain timming. > It may be better to > perform an update-readback-compare on each superblock before moving > onto the next, so as to avoid this particular failure in the future. I > doubt this would slow things down much as the superblocks must be > cached in memory anyway. That should be done by block layer, where things like dm-integrity could help. > > 2) The recovery tools seem too dumb while thinking they are smarter > than they are. There should be some way to tell the various tools to > consider some subset of the drives in a system as worth considering. My fault, in fact there is a -F option for dump-super, to force it to recognize the bad superblock and output whatever it has. In that case at least we could be able to see if it was really corrupted or just some bitflip in magic numbers. > Not knowing that a superblock was a single 4096-byte sector, I had > primed my recovery by copying a valid superblock from one drive to the > clone of my broken drive before starting the ddrescue of the failing > drive. I had hoped that I could piece together a valid superblock from > a good drive, and whatever I could recover from the failing one. In > the end this turned out to be a useful strategy, but meanwhile I had > two drives that both claimed to be drive 2 of 4, and no drive claiming > to be drive 1 of 4. The tools completely failed to deal with this case > and were consistently preferring to read the bogus drive 2 instead of > the real drive 2, and it wasn't until I deliberately patched over the > magic in the cloned drive that I could use the various recovery tools > without bizarre and spurious errors. I understand how this was never > an anticipated scenario for the recovery process, but if its happened > once, it could happen again. Just dealing with a failing drive and its > clone both available in one system could cause this. Well, most tools put more focus on not screwing things further, so it's common it's not as smart as user really want. At least, super-recover could take more advantage of using chunk tree to regenerate the super if user really want. (Although so far only one case, and that's your case, could take use of this possible new feature though) > > 3) There don't appear to be any tools designed for dumping a full > superblock in hex notation, or for patching a superblock in place. > Seeing as I was forced to use a hex editor to do exactly that, and > then go through hoops to generate a correct CSUM for the patched > block, I would certainly have preferred there to be some sort of > utility to do the patching for me. Mostly because we think current super-recovery is good enough, until your case. > > 4) Despite having lost all 3 superblocks on one drive in a 4-drive > setup (RAID0 Data with RAID1 Metadata), it was possible to derive all > missing information needed to rebuild the lost superblock from the > existing good drives. I don't know how often it can be done, or if it > was due to some peculiarity of the particular RAID configuration I was > using, or what. But seeing as this IS possible at least under some > circumstances, it would be useful to have some recovery tools that > knew what those circumstances were, and could make use of them. In fact, you don't even need any special tool to do the recovery. The basic ro+degraded mount should allow you to recover 75% of your data. And btrfs-recovery should do pretty much the same. The biggest advantage you have is, your faith and knowledge about only superblocks are corrupted in the device, which turns out to be a miracle. (While at the point I know your backup supers are also corrupted, I lose the faith) Thanks, Qu > > 5) Finally, I want to comment on the fact that each drive only stored > up to 3 superblocks. Knowing how important they are to system > integrity, I would have been happy to have had 5 or 10 such blocks, or > had each drive keep one copy of each superblock for each other drive. > At 4K per superblock, this would seem a trivial amount to store even > in a huge raid with 64 or 128 drives in it. Could there be some method > introduced for keeping far more redundant metainformation around? I > admit I'm unclear on what the optimal numbers of these things would > be. Certainly if I hadn't lost all 3 superblocks at once, I might have > thought that number adequate. > > Anyway, I hope no one takes these criticisms the wrong way. I'm a huge > fan of BTRFS and its potential, and I know its still early days for > the code base, and it's yet to fully mature in its recovery and > diagnostic tools. I'm just hoping that these points can contribute in > some small way and give back some of the help I got in fixing my > system! > > >
Attachment:
signature.asc
Description: OpenPGP digital signature
