Am Freitag, den 20.12.2019, 14:05 +0800 schrieb Qu Wenruo: > > On 2019/12/20 上午4:00, Ralf Zerres wrote: > > Dear list, > > > > at customer site i can't mount a given btrfs device in rw mode. > > this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff. > > Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over. > > But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs. > > > > Here are the details: > > > > kernel: 5.2.2 (Ubuntu 18.04.3) > > btrfs-progs: 5.2.1 > > HBA: DELL Perc > > # storcli /c0/v0 > > # 0/0 RAID5 Optl RW Yes RWBD - OFF 7.274 TB SSD-Data > > #btrfs fi show /dev/sdX > > #Label: 'Data-Ssd' uuid: <my uuid> > > # Total devices 1 FS bytes used 7.12TiB > > # devid 1 size 7.27TiB used 7.27TiB path /dev/<mydev> > > > > What happend: > > Customer filled up the filesystem (lots of snapshots in a couple of subvolumes). > > System was working with kernel 4.15 and btrfs-progs 4.15. I updated kernel and btrfs-progs with the assumption > > more mainlined/actual tools could do a better job. Since they have seen lots of fixups. > > > > 1) As a first step, i did run > > > > # btrfs check --mode lowmem --progress /dev/<mydev> > > The initial report would help a lot to determine the root cause of > corruption in first place. > > But if btrfs check (both modes) report error, you'd better not to think > --repair can do a better job. > > Currently btrfs check is only good at finding problems, not really > fixing them. > thanks for this clarification. > As there are too many things to consider when doing repair, so at least > --repair is far from "production ready". > That's why in v5.4 progs, we add extra wait time for --repair. > which means we have to wait until development can finish this task. Until this situation i will regard --repair as a WIP function that may help, may not. Only use it for data sets for which valid backups exist or be prepared to loose data. > > > > got extend mismatches and wrong extend CRC's > > > > 2) As a second step i did try to mount in recovery mode > > > > # mount -t btrfs -o defaults, recovery, skip_balance /dev/<mydev> /mnt > > > > I included skip_balance, since there might be an unfinished balance run. But this didn't work out. > > The dmesg would help to find out what went wrong. > > Just a tip for such report, the initial error message is always the most > important thing. > > > > > 3) As a third step, got it mounted with ro mode > > > > # mount -t btrfs -o ro /dev/<mydev> /mnt > > > > And filed data received via usage: > > > > # btrfs fi usage /mnt > > # Overall: > > # Device size: 7.27TiB > > # Device allocated: 7.27TiB > > # Device unallocated: 1.00MiB > > # Device missing: 0.00B > > # Used: 7.13TiB > > # Free (estimated): 134.13GiB (min: 134.13GiB) > > # Data ratio: 1.00 > > # Metadata ratio: 2.00 > > # Global reserve: 512.00MiB (used: 0.00B) > > # > > # Data,single: Size:7.23TiB, Used:7.10TiB > > # /dev/<mydev> 7.23TiB > > # > > # Metadata,DUP: Size:21.50GiB, Used:14.31GiB > > # /dev/<mydev> 43.00GiB > > # > > # System,DUP: Size:8.00MiB, Used:864.00KiB > > # /dev/<mydev> 16.00MiB > > > > # Unallocated: > > # /dev/<mydev> 1.00MiB > > > > Obviously, totally filled up. > > At that time i copied out all relevant data - you never know ... Finished! > > > > Then tried to unmout, but that got to nowhere. Leads to a reboot . > > > > > > 4) As a forth step, i tried to repair it > > > > # btrfs check --mode lowmem --progress --repair /dev/<mydev> > > # enabling repair mode > > # WARNING: low-memory mode repair support is only partial > > # Opening filesystem to check... > > # Checking filesystem on /dev/<mydev> > > # UUID: <my UUID> > > # [1/7] checking root items (0:00:33 elapsed, 20853512 items checked) > > # Fixed 0 roots. > > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34 > > # ERROR: fail to allocate new chunk No space left on device > > # Try to exclude all metadata blcoks and extents, it may be slow > > # Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked) > > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34 > > # Delete backref in extent [1988733435904 134217728] > > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34 > > # ERROR: commit_root already set when starting transaction > > # ERROR: fail to start transaction: Invalid argument > > # ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319 > > # ERROR: commit_root already set when starting transaction > > # ERROR: fail to start transaction Invalid argument > > > > This ends with a core-dump. > > > > Last not least my question: > > > > I'm not experienced enough to solve this issue myself and need your help. > > Is it worth the time and effort to solve this issue? > > I don't think it would be worthy, unless you're a really super kind guy > who want to make btrfs-progs better. > The time to repair the image could easily be more than just restoring > the backup, not to mention it's not ensured to save it. > I will give btrfs-prog 5.4 a run on 5.4 kernel booted system. The ssd-pool is still availabel in the corrupted state. And it will not go into production anyway, before the capacity can be extended. The disks are ordered and are on there way. I will just do the --repair as an academic process (not calling me a super nice guy). But it might give some insight. > > Developers might be interested while having a real live testbed? > > Do you need any further info that will help to solve the issue? > > In this case, the history of the corruption would be more useful. > > But since it's 4.15 kernel which may not have enough fixes backported > (since it's Ubuntu, not SUSE kernel), and the 5.2.2 is not safe at all > (you need 5.3.0 or 5.2.15) we can't even determine if it's 5.2.2 causing > the corruption in the first place. Well, i do expect 5.4.0 to be equally valid. To bad that there is no official backport for Ubuntu stable (aka 18.04.x) > So I'm not sure if we can get more juice from the report. > When i add the new disks to the Raid5, i will definetely reformat a new btrfs filesystem to be sure it is clean and has no faults. Then the subvols and data will be restored with btrfs-send/btrfs-receive. > Thanks, > Qu > Qu, thanks a bunch for your time and the fruitful information. Ralf > > > > > > > Best regards > > Ralf > > > > > > > > > > >
