Re: How to heel this btrfs fi corruption?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/12/20 上午4:00, Ralf Zerres wrote:
> Dear list,
> 
> at customer site i can't mount a given btrfs device in rw mode.
> this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff.
> Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over.
> But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs.
> 
> Here are the details:
> 
> kernel: 5.2.2 (Ubuntu 18.04.3)
> btrfs-progs: 5.2.1
> HBA: DELL Perc
> # storcli /c0/v0
> # 0/0   RAID5 Optl  RW     Yes     RWBD  -   OFF 7.274 TB SSD-Data
> #btrfs fi show /dev/sdX
> #Label: 'Data-Ssd'  uuid: <my uuid>
> #        Total devices 1 FS bytes used 7.12TiB
> #        devid    1 size 7.27TiB used 7.27TiB path /dev/<mydev>
> 
> What happend:
> Customer filled up the filesystem (lots of snapshots in a couple of subvolumes).
> System was working with kernel 4.15 and btrfs-progs 4.15. I updated kernel and btrfs-progs with the assumption
> more mainlined/actual tools could do a better job. Since they have seen lots of fixups.
> 
> 1) As a first step, i did run
> 
> # btrfs check --mode lowmem --progress /dev/<mydev>

The initial report would help a lot to determine the root cause of
corruption in first place.

But if btrfs check (both modes) report error, you'd better not to think
--repair can do a better job.

Currently btrfs check is only good at finding problems, not really
fixing them.

As there are too many things to consider when doing repair, so at least
--repair is far from "production ready".
That's why in v5.4 progs, we add extra wait time for --repair.

> 
> got extend mismatches and wrong extend CRC's
> 
> 2) As a second step i did try to mount in recovery mode
> 
> # mount -t btrfs -o defaults, recovery, skip_balance /dev/<mydev> /mnt
> 
> I included skip_balance, since there might be an unfinished balance run. But this didn't work out.

The dmesg would help to find out what went wrong.

Just a tip for such report, the initial error message is always the most
important thing.

> 
> 3) As a third step, got it mounted with ro mode
> 
> # mount -t  btrfs -o ro /dev/<mydev> /mnt
> 
> And filed data received via usage:
> 
> # btrfs fi usage /mnt
> # Overall:
> #    Device size:                   7.27TiB
> #    Device allocated:              7.27TiB
> #    Device unallocated:            1.00MiB
> #    Device missing:                  0.00B
> #    Used:                          7.13TiB
> #    Free (estimated):            134.13GiB      (min: 134.13GiB)
> #    Data ratio:                       1.00
> #    Metadata ratio:                   2.00
> #    Global reserve:              512.00MiB      (used: 0.00B)
> #
> # Data,single: Size:7.23TiB, Used:7.10TiB
> #   /dev/<mydev>        7.23TiB
> #
> # Metadata,DUP: Size:21.50GiB, Used:14.31GiB
> #   /dev/<mydev>       43.00GiB
> #
> # System,DUP: Size:8.00MiB, Used:864.00KiB
> #   /dev/<mydev>       16.00MiB
> 
> # Unallocated:
> #   /dev/<mydev>        1.00MiB
> 
> Obviously, totally filled up.
> At that time i copied out all relevant data - you never know ... Finished!
> 
> Then tried to unmout, but that got to nowhere. Leads to a reboot .
> 
> 
> 4) As a forth step, i tried to repair it
> 
> # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> # enabling repair mode
> # WARNING: low-memory mode repair support is only partial
> # Opening filesystem to check...
> # Checking filesystem on /dev/<mydev>
> # UUID: <my UUID>
> # [1/7] checking root items                      (0:00:33 elapsed, 20853512 items checked)
> # Fixed 0 roots.
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
> # ERROR: fail to allocate new chunk No space left on device
> # Try to exclude all metadata blcoks and extents, it may be slow
> # Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked)
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34
> # Delete backref in extent [1988733435904 134217728]
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction: Invalid argument
> # ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction Invalid argument
> 
> This ends with a core-dump.
> 
> Last not least my question:
> 
> I'm not experienced enough to solve this issue myself and need your help. 
> Is it worth the time and effort to solve this issue?

I don't think it would be worthy, unless you're a really super kind guy
who want to make btrfs-progs better.
The time to repair the image could easily be more than just restoring
the backup, not to mention it's not ensured to save it.

> Developers might be interested while having a real live testbed?
> Do you need any further info that will help to solve the issue?

In this case, the history of the corruption would be more useful.

But since it's 4.15 kernel which may not have enough fixes backported
(since it's Ubuntu, not SUSE kernel), and the 5.2.2 is not safe at all
(you need 5.3.0 or 5.2.15) we can't even determine if it's 5.2.2 causing
the corruption in the first place.

So I'm not sure if we can get more juice from the report.

Thanks,
Qu

> 
> 
> Best regards
> Ralf
> 
> 
> 
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux