Re: Need help recovering broken RAID5 array (parent transid verify failed)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 15, 2020 at 12:03 AM Emil Heimpel
<broetchenrackete@xxxxxxxxx> wrote:
>
>
> Hi,
>
> I hope this is the right place to ask for help. I am unable to mount my BTRFS array and wanted to know, if it is possible to recover (some) data from it.

Hi, yes it is!


>
> I have a RAID1-Metadata/RAID5-Data array consisting of 6 drives, 2x8TB, 5TB, 4TB and 2x3TB. It was running fine for the last 3 months. Because I expanded it drive by drive I wanted to do a full balance the other day, when after around 40% completion (ca 1.5 days) I noticed, that one drive was missing from the array (If I remember correctly, it was the 5TB one). I tried to cancel the balance, but even after a few hours it didn't cancel, so I tried to do a reboot. That didn't work either, so I did a hard reset. Probably not the best idea, I know....

The file system should be power fail safe (with some limited data
loss), but the hardware can betray everything. Your configuration is
better due to raid1 metadata.
>
> After the reboot all drives appeared again but now I can't mount the array anymore, it gives me the following error in dmesg:
>
> [  858.554594] BTRFS info (device sdc1): disk space caching is enabled
> [  858.554596] BTRFS info (device sdc1): has skinny extents
> [  858.556165] BTRFS error (device sdc1): parent transid verify failed on 23219912048640 wanted 116443 found 116484
> [  858.556516] BTRFS error (device sdc1): parent transid verify failed on 23219912048640 wanted 116443  found 116484
> [  858.556527] BTRFS error (device sdc1): failed to read chunk root
> [  858.588332] BTRFS error (device sdc1): open_ctree failed

Extent tree is damaged, but it's unexpected that a newer transid is
found than is wanted. Something happened out of order. Both copies.

What do you get for:
# btrfs rescue super -v /dev/anydevice
# btrfs insp dump-s -fa /dev/anydevice
# btrfs insp dump-t -b 30122546839552 /dev/anydevice
# mount -o ro,nologreplay,degraded /dev/anydevice



>
> [bluemond@BlueQ btrfslogs]$ sudo btrfs check /dev/sdd1

For what it's worth, btrfs check does find all member devices, so you
only have to run check on any one of them. However, scrub is
different, you can run that individually per block device to work
around some performance problems with raid56, when running it on the
volume's mount point.

> And how can I prevent it from happening again? Would using the new multi-parity raid1 for Metadata help?

Difficult to know yet what went wrong. Do you have dmesg/journalctl -k
for the time period the problem drive began all the way to the forced
power off? It might give a hint. Before doing a forced poweroff while
writes are happening it might help to disable the write cache on all
the drives; or alternatively always disable them.

> I'm running arch on an ssd.
> [bluemond@BlueQ btrfslogs]$ uname -a
> Linux BlueQ 5.6.12-arch1-1 #1 SMP PREEMPT Sun, 10 May 2020 10:43:42 +0000 x86_64 GNU/Linux
>
> [bluemond@BlueQ btrfslogs]$ btrfs --version
> btrfs-progs v5.6

5.6.1 is current but I don't think there's anything in the minor
update that applies here.

Post that info and maybe a dev will have time to take a look. If it
does mount ro,degraded, take the chance to update backups, just in
case. Yeah, ~21TB will be really inconvenient to lose. Also, since
it's over the weekend, and there's some time, it might be useful to
have a btrfs image:

btrfs-image -ss -c9 -t4 /dev/anydevice ~/problemvolume.btrfs.bin

This file will be roughly 1/2 the size of file system metadata. I
guess you could have around 140G of metadata depending on nodesize
chosen at mkfs time, and how many small files this filesystem has.

Still another option that might make it possible to mount, if above
doesn't work; build the kernel with this patch
https://patchwork.kernel.org/project/linux-btrfs/list/?series=170715

Mount using -o ro,nologreplay,rescue=skipbg

This also doesn't actually fix the problem, it just might make it
possible to mount the file system, mainly for updating backups in case
it's not possible to fix.


--
Chris Murphy




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux