Hi.
My primary (boot) filesystem is broken, due to an interrupted resize operation.
I'm hoping that I can get help either fixing the filesystem or
recovering some of my data, but I'd also like to know why btrfs and
its tools acted the way they did.
I think I've also found a bug in GParted.
Output of uname -a on the recovery medium: Linux ArchUSB
4.18.3-arch1-1-ARCH #1 SMP PREEMPT Sat Aug 18 09:22:54 UTC 2018 x86_64
GNU/Linux
Kernel on the affected system: Liquorix Linux 4.17.14
Output of btrfs --version: btrfs-progs v4.17.1
Relevant dmesg log is attached.
I currently use a single btrfs filesystem (fs A) with subvolumes for
root, home, and var. It's ~140 GiB in size, with ~130GiB used.
I also have a second btrfs filesystem (fs B) from a previous
installation of Arch Linux. It's ~40 GiB in size, with less than 30
GiB used.
First of all, how I got into this situation:
Yesterday, I wanted to reclaim some space, so I decided to shrink fs B.
I opened up GParted, and resized it. I got an error from btrfs, like
"no space left on device".
I was a little confused. This was seemingly solved by unmounting fs B.
I re-ran the reisze. Some process was using 100% CPU on one core, and
I didn't see much (if any) I/O activity.
After a few minutes I noticed that GParted was attempting to resize
the mount point of fs B ('/mnt/oldsys' on fs A), even though it wasn't
mounted!
I'm not sure what this does, but I figure that it's not good (even
though fs A shouldn't fit in 32 GiB).
So I press cancel... nothing. I try force cancel, still no effect. I
try to kill the resize process, first with SIGTERM, then with
SIGKILL... nothing.
I figure that I have to reboot at this point. During reboot, systemd
waits really long for a stop job on some user session thing.
Then, strangely, I see output _during shutdown_ about btrfs
_beginning_ to resize fs A (referring to it by its block device,
/dev/sda2)...
I choose to reset my computer. When I try to boot again (bad idea in
retrospect) systemd takes a long time to "re-mount the root
filesystem".
Start jobs begin to timeout and fail, so I reset my computer again to
boot into a recovery medium.
Part 2, the attempted recovery:
I run btrfsck on sda2, and seeing lots of errors (see
btrfsck.readonly.log) I choose to make an image of the entire block
device.
After that, I attempt a btrfsck --repair (btrfsck.repair.log).
It seems alright until it reaches "Deleting bad dir index" and then
hangs. IIRC at some point it segfaulted.
Desperately, I try to run another repair... and I encounter a BUG_ON
(btrfsck.repair.2.log). Ouch.
Well, at this point I'm stuck. I have no backups.
I've already restored the image of fs A from before any repair attempts.
Thanks in advance,
Jared
Some big files that I couldn't attach:
btrfsck.readonly.log:
https://drive.google.com/open?id=1CZP67uCs7zCyi1CfPv6tt9DxnIfU967u
btrfsck.repair.log:
https://drive.google.com/open?id=1l2Nj8n9CzmxRZznbbIYEMrQc6c5plMDm
P.S. Sorry if this gets sent twice -- Gmail failed to deliver it the first time.
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
Deleting bad dir index [258,96,2] root 18446744073709551608
extent-tree.c:1423: btrfs_inc_extent_ref: BUG_ON `err` triggered, value -5
btrfs check(+0x1e612)[0x5593cc313612]
btrfs check(btrfs_inc_extent_ref+0x166)[0x5593cc3167b6]
btrfs check(+0x1f064)[0x5593cc314064]
btrfs check(__btrfs_cow_block+0x5e5)[0x5593cc309af5]
btrfs check(btrfs_cow_block+0xf9)[0x5593cc309ea9]
btrfs check(btrfs_search_slot+0x36d)[0x5593cc30cd1d]
btrfs check(btrfs_lookup_dir_index+0x58)[0x5593cc31ea58]
btrfs check(+0x5d6ae)[0x5593cc3526ae]
btrfs check(cmd_check+0x3580)[0x5593cc35d460]
btrfs check(main+0x88)[0x5593cc3080b8]
/usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f95fa1b2223]
btrfs check(_start+0x2e)[0x5593cc3081de]
[ 2137.123004] BTRFS info (device sda2): disk space caching is enabled
[ 2137.123005] BTRFS info (device sda2): has skinny extents
[ 2137.189323] BTRFS info (device sda2): enabling ssd optimizations
[ 3409.093491] traps: btrfsck[962] general protection ip:5572b1e0c9c2 sp:7fffbc926c80 error:0 in btrfs[5572b1dfe000+7a000]
[ 3409.093520] audit: type=1701 audit(1534813029.734:4): auid=1000 uid=0 gid=0 ses=1 pid=962 comm="btrfsck" exe="/usr/bin/btrfs" sig=11 res=1
[ 3563.069359] INFO: task btrfsck:962 blocked for more than 120 seconds.
[ 3563.069814] Not tainted 4.18.3-arch1-1-ARCH #1
[ 3563.070239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3563.070666] btrfsck D 0 962 960 0x80000080
[ 3563.070671] Call Trace:
[ 3563.070682] ? __schedule+0x29b/0x8b0
[ 3563.070686] schedule+0x32/0x90
[ 3563.070690] schedule_preempt_disabled+0x14/0x20
[ 3563.070695] __mutex_lock.isra.0+0x220/0x530
[ 3563.070700] ? file_update_time+0x5e/0x130
[ 3563.070705] pipe_write+0x36/0x3f0
[ 3563.070710] ? _raw_spin_unlock+0x16/0x30
[ 3563.070714] ? follow_page_pte+0x3a7/0x5f0
[ 3563.070720] __vfs_write+0x13f/0x190
[ 3563.070725] __kernel_write+0x4d/0xf0
[ 3563.070730] dump_emit+0x91/0xc0
[ 3563.070737] elf_core_dump+0x130b/0x14a0
[ 3563.070747] do_coredump+0x650/0xe98
[ 3563.070757] get_signal+0x29b/0x5d0
[ 3563.070761] ? general_protection+0x8/0x30
[ 3563.070769] do_signal+0x36/0x650
[ 3563.070773] ? _raw_spin_lock+0x13/0x40
[ 3563.070777] ? _raw_spin_unlock+0x16/0x30
[ 3563.070781] ? _raw_spin_lock_irqsave+0x25/0x50
[ 3563.070785] ? _raw_spin_unlock_irqrestore+0x20/0x40
[ 3563.070790] ? try_to_wake_up+0x231/0x480
[ 3563.070795] ? general_protection+0x8/0x30
[ 3563.070801] exit_to_usermode_loop+0xba/0xe0
[ 3563.070805] prepare_exit_to_usermode+0x64/0x90
[ 3563.070810] retint_user+0x8/0x8
[ 3563.070815] RIP: 0033:0x5572b1e0c9c2
[ 3563.070816] Code: Bad RIP value.
[ 3563.070826] RSP: 002b:00007fffbc926c80 EFLAGS: 00010202
[ 3563.070829] RAX: 0000d572746076df RBX: 000055736f0ec2c0 RCX: 0000000000000009
[ 3563.070831] RDX: 000000000000002a RSI: 0000000000000009 RDI: 0000000000000000
[ 3563.070833] RBP: 0000000000500000 R08: 00000000fffffffe R09: 00005572b7ce0990
[ 3563.070835] R10: 0000000000000000 R11: 00005572b239eef0 R12: 00005572bbf2d7d0
[ 3563.070837] R13: 00007fffbc926ccf R14: 00000000000000b6 R15: 00005572b7ce0a10
[ 3563.070842] INFO: task systemd-coredum:1020 blocked for more than 120 seconds.
[ 3563.071272] Not tainted 4.18.3-arch1-1-ARCH #1
[ 3563.071701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3563.072134] systemd-coredum D 0 1020 1 0x000001a0
[ 3563.072137] Call Trace:
[ 3563.072142] ? __schedule+0x29b/0x8b0
[ 3563.072147] ? preempt_count_add+0x68/0xa0
[ 3563.072150] schedule+0x32/0x90
[ 3563.072156] wb_wait_for_completion+0x59/0x90
[ 3563.072161] ? wait_woken+0x80/0x80
[ 3563.072166] __writeback_inodes_sb_nr+0xaa/0xd0
[ 3563.072172] try_to_writeback_inodes_sb+0x3c/0x50
[ 3563.072208] ext4_nonda_switch+0x7b/0x80 [ext4]
[ 3563.072231] ext4_da_write_begin+0x6f/0x480 [ext4]
[ 3563.072239] generic_perform_write+0xb7/0x1c0
[ 3563.072242] ? file_update_time+0x5e/0x130
[ 3563.072246] __generic_file_write_iter+0xfa/0x1c0
[ 3563.072250] ? mem_cgroup_event_ratelimit.isra.11+0x36/0xa0
[ 3563.072269] ext4_file_write_iter+0xc6/0x3e0 [ext4]
[ 3563.072275] do_iter_readv_writev+0x145/0x1a0
[ 3563.072279] do_iter_write+0x80/0x190
[ 3563.072283] iter_file_splice_write+0x255/0x3b0
[ 3563.072289] __se_sys_splice+0x31b/0x770
[ 3563.072295] do_syscall_64+0x5b/0x170
[ 3563.072300] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3563.072303] RIP: 0033:0x7f1de9d12f4b
[ 3563.072304] Code: Bad RIP value.
[ 3563.072310] RSP: 002b:00007ffc41aeb798 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[ 3563.072314] RAX: ffffffffffffffda RBX: 00000000267ed000 RCX: 00007f1de9d12f4b
[ 3563.072315] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000005
[ 3563.072317] RBP: 00007ffc41aeb820 R08: 00000000267ed000 R09: 0000000000000000
[ 3563.072319] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000267ed000
[ 3563.072320] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000000000
[ 3808.825151] INFO: task systemd-coredum:1020 blocked for more than 120 seconds.
[ 3808.825609] Not tainted 4.18.3-arch1-1-ARCH #1
[ 3808.826039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3808.826474] systemd-coredum D 0 1020 1 0x000001a4
[ 3808.826479] Call Trace:
[ 3808.826490] ? __schedule+0x29b/0x8b0
[ 3808.826495] ? preempt_count_add+0x68/0xa0
[ 3808.826499] schedule+0x32/0x90
[ 3808.826507] wb_wait_for_completion+0x59/0x90
[ 3808.826512] ? wait_woken+0x80/0x80
[ 3808.826518] __writeback_inodes_sb_nr+0xaa/0xd0
[ 3808.826523] try_to_writeback_inodes_sb+0x3c/0x50
[ 3808.826562] ext4_nonda_switch+0x7b/0x80 [ext4]
[ 3808.826585] ext4_da_write_begin+0x6f/0x480 [ext4]
[ 3808.826593] generic_perform_write+0xb7/0x1c0
[ 3808.826598] __generic_file_write_iter+0xfa/0x1c0
[ 3808.826618] ext4_file_write_iter+0xc6/0x3e0 [ext4]
[ 3808.826622] ? __seccomp_filter+0x43/0x490
[ 3808.826628] __vfs_write+0x13f/0x190
[ 3808.826634] vfs_write+0xa9/0x190
[ 3808.826638] ksys_write+0x4f/0xb0
[ 3808.826644] do_syscall_64+0x5b/0x170
[ 3808.826649] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3808.826653] RIP: 0033:0x7f1de9dec0b5
[ 3808.826654] Code: Bad RIP value.
[ 3808.826664] RSP: 002b:00007ffc41aeb6d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3808.826668] RAX: ffffffffffffffda RBX: 0000000000032bf9 RCX: 00007f1de9dec0b5
[ 3808.826670] RDX: 0000000000032bf9 RSI: 00007f1de5a0a010 RDI: 0000000000000007
[ 3808.826672] RBP: 00007f1de5a0a010 R08: 00007f1d659a9000 R09: 00007f1da4b28ff4
[ 3808.826673] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
[ 3808.826675] R13: 0000000000000000 R14: ffffffffffffffff R15: 000000003f180000
[ 3931.703090] INFO: task systemd-coredum:1020 blocked for more than 120 seconds.
[ 3931.703551] Not tainted 4.18.3-arch1-1-ARCH #1
[ 3931.703989] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3931.704429] systemd-coredum D 0 1020 1 0x000001a4
[ 3931.704434] Call Trace:
[ 3931.704445] ? __schedule+0x29b/0x8b0
[ 3931.704450] ? preempt_count_add+0x68/0xa0
[ 3931.704454] schedule+0x32/0x90
[ 3931.704460] wb_wait_for_completion+0x59/0x90
[ 3931.704465] ? wait_woken+0x80/0x80
[ 3931.704470] __writeback_inodes_sb_nr+0xaa/0xd0
[ 3931.704476] try_to_writeback_inodes_sb+0x3c/0x50
[ 3931.704514] ext4_nonda_switch+0x7b/0x80 [ext4]
[ 3931.704537] ext4_da_write_begin+0x6f/0x480 [ext4]
[ 3931.704545] generic_perform_write+0xb7/0x1c0
[ 3931.704550] __generic_file_write_iter+0xfa/0x1c0
[ 3931.704569] ext4_file_write_iter+0xc6/0x3e0 [ext4]
[ 3931.704574] ? __seccomp_filter+0x43/0x490
[ 3931.704580] __vfs_write+0x13f/0x190
[ 3931.704586] vfs_write+0xa9/0x190
[ 3931.704590] ksys_write+0x4f/0xb0
[ 3931.704596] do_syscall_64+0x5b/0x170
[ 3931.704602] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3931.704606] RIP: 0033:0x7f1de9dec0b5
[ 3931.704607] Code: Bad RIP value.
[ 3931.704617] RSP: 002b:00007ffc41aeb6d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3931.704621] RAX: ffffffffffffffda RBX: 0000000000032bf9 RCX: 00007f1de9dec0b5
[ 3931.704622] RDX: 0000000000032bf9 RSI: 00007f1de5a0a010 RDI: 0000000000000007
[ 3931.704624] RBP: 00007f1de5a0a010 R08: 00007f1d659a9000 R09: 00007f1da4b28ff4
[ 3931.704626] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
[ 3931.704628] R13: 0000000000000000 R14: ffffffffffffffff R15: 000000003f180000
[ 3982.718116] audit: type=1701 audit(1534813603.367:5): auid=1000 uid=0 gid=0 ses=1 pid=960 comm="sudo" exe="/usr/bin/sudo" sig=11 res=1