`btrfs check --repair` stuck in a loop // filesystem repair case study

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a btrfs filesystem on a 4TB HDD connected with USB 2.0. Some time ago I accidentally disconnected the drive while doing heavy writes. After reconnecting it seemed like the filesystem still works (it mounted fine and I could read some files chosen at random), but I ran `btrfs scrub` to be sure.

`btrfs scrub` aborted itself after ~20 hours, after reading ~3.5TB of data. `dmesg` contained a single line:

#v+
BTRFS error (device dm-0): bad tree block start 0 3527021166592
#v-

I couldn't find any further details anywhere in logs. I assume this means that some data have actually been lost from this filesystem. I have backups of data from this drive, so I decided to play a little trying out btrfs recovery strategies.

I checked whether there are any bad blocks on the raw device — all blocks were read successfully.

I created a devicemapper snapshot/overlay to keep the raw device data read only and track the changes made by any recovery procedures.

I ran `btrfstune -u` on the overlay to avoid having two devices with the same uuid. This was done using a dedicated VM which did not see the raw device (suggested by `Ke` on IRC). BTW, this command resulted in the overlay device growing by ~25GB, which IIUC means that around 6M 4096-byte blocks were changed in the process (is that expected?).

I was recommended to run `btrfs check`. The result is here: [1] (323 lines of output), and IIRC it finished in few hours.

 [1] https://gist.github.com/liori/f8c5e69677e8c9d6038d2e3e4db9aa42

(5 data checksum errors are a preexisting condition, I knew about them before the incident).

I then started `btrfs check --repair`. This was about a week ago, and it is still going. The partial output is here: [2] (already almost 18k lines). The same problems are being found again and again in a loop, as if it was stuck.

 [2] https://gist.github.com/liori/01494afbe63cd19ba49be663be937d84

I do observe that the ctime of the overlay file is updated every once a while, but the file itself does not grow anymore after some initial change of ~70k blocks. My interpretation is that even if the repair process writes anything, it only keeps writing in the same places again and again.

I did not have any snapshots on this filesystem. I did have some deduplicated content, but no more than 4 copies of any data block, and deduplication resulted in saving ~1TB of space total. The device was never a part of a multi-device setup.

Is there anything more I can do with this filesystem to bring it to a state where I can `btrfs scrub` it, know what have been lost, etc? Is this behavior of `btrfs scrub --repair` expected and will it ever finish?

Thank you,


--
Tomasz Melcer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html





[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux