On Sat, 12 Mar 2016 20:48:47 +0500 Roman Mamedov <rm@xxxxxxxxxxx> wrote: > The system was seemingly running just fine for days or weeks, then I > routinely deleted a bunch of old snapshots, and suddenly got hit with: > > [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133 > [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133 As I mentioned, the initial run of btrfsck --repair did not do anything to fix this problem; I started btrfsck --repair --init-extent-tree, but it still not finished after 5 days, so I looked for other options. While reviewing the btrfs-progs source for some attempts to make btrfsck do something about these transid-failures, I spotted the tool called btrfs-corrupt-block. At this point I was ready to accept some loss of data, which I'd expect to be minor if even user-visible at all (after all the original backtrace is happening in "btrfs_clean_one_deleted_snapshot" so perhaps all that the "bad" block was storing was only related to a snapshot that's already been deleted). I ran: /root/btrfs-corrupt-block -l 7483566862336 /dev/nbd8 Btrfsck then finally reported something inspiring some hope: checking extents checksum verify failed on 7483566862336 found 295F0086 wanted 00000000 checksum verify failed on 7483566862336 found 295F0086 wanted 00000000 checksum verify failed on 7483566862336 found 295F0086 wanted 00000000 checksum verify failed on 7483566862336 found 295F0086 wanted 00000000 bytenr mismatch, want=7483566862336, have=0 deleting pointer to block 7483566862336 ref mismatch on [6504947712 118784] extent item 0, found 1 adding new data backref on 6504947712 parent 4311306919936 owner 0 offset 0 found 1 Backref 6504947712 parent 4311306919936 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 6504947712 parent 4311306919936 owner 0 offset 0 found 1 wanted 0 back 0x57cfdff0 backpointer mismatch on [6504947712 118784] ...etc After a few passes it settled into a state with no new errors reported (only a few of "bad metadata crossing stripe boundary", but those seem to be also commonly reported in connection with filesystems otherwise exhibiting no issues). Finally I was able to mount the FS with no backtrace occurring anymore -- the btrfs-cleaner process then finished all the remaining snapshot deletion work, freeing up 20GB or so. All data seems to be present, and selective checksum verifications showed no corruption. Well, this machine is primarily a backup server using rsync, so it should catch and fix-up any losses. As a side note, for experiments with 'btrfsck --repair', 'btrfs-corrupt-block' and my own patched versions of btrfsck, the technique of making writable CoW snapshots of the whole block device has proved invaluable: At first I used the nbd-server '-c' mode, but quickly discovered it to be flaky: it seems to crash if the amount of changes gets over 150 MB or so, and anyways the RAM usage of it seems to match "block device size / 1000", i.e. it used 6GB of RAM for a 6TB filesystem. So in the end I changed to using the dm-snapshot target as described in [1]. One just has to remember to never have the snapshot and the original device visible and trying to mount one of them on the same machine (this will confuse Btrfs with duplicate UUIDs); for that, I used the same nbd-server (not using its built-in CoW anymore), exporting writable snapshots via network and mounting them on a different server or VM. [1]http://stackoverflow.com/questions/7582019/lvm-like-snapshot-on-a-normal-block-device -- With respect, Roman
Attachment:
pgpZZVwj0vY_E.pgp
Description: OpenPGP digital signature
