On 19/10/2019 21:29, Peter Hjalmarsson wrote:
Hi Peter,
Thanks for the report.
> While trying to reproduce another problem I have seen with BTRFS while
> running balance and raid6 I hit an issue resulting in:
> BUG: kernel NULL pointer dereference, address: 00000000000002ce
>
> I created a script trying to pinpoint the problem utilizing zram, and
> the run goes like this:
>
> 1. run the scripts which sets up a "SINGEL" filesystem on two larger
> devices, fills it with an adequate amount of data, and then tries to
> convert it to raid6
> 2. the filesystem will fail to convert due to not enough space to
> convert all data to raid6 (not enough space on at least 3 devices at
> the same time), hitting an enospc-error
> * Up until this point stuff still seems to work without crashing, and
> the system seems stable, just with two different profiles fot the data
> 3. issue the umount command which will be "Killed", and a backtrace
> will be written to dmesg
>
> This results in a filesystem that cannot be synced, unmounted, and in
> all just seems crashed.
> I have ran the script 6 times. 1 time it passed, 5 times it has
> crashed, and I have rebooted the computer, since that is the only way
> it seems to get rid of the filesysem, so the issue seems pretty
> reproducable.
>
> The system is a laptop running Fedora 31 Beta with the latest updates,
> and the latest kernel (5.3.6-300.fc31.x86_64)
>
> Do you have any input, or any other questions or stuff you want me to test?
>
> The script looks as follow:
> -----
> $ cat run-btrfs-test
> modprobe -iv zram num_devices=8
> udevadm trigger
> sync
> zramctl /dev/zram0 -s 8G && \
> zramctl /dev/zram1 -s 8G && \
> zramctl /dev/zram2 -s 4G && \
> zramctl /dev/zram3 -s 4G && \
> zramctl /dev/zram4 -s 4G && \
> zramctl /dev/zram5 -s 2G && \
> zramctl /dev/zram6 -s 2G && \
> zramctl /dev/zram7 -s 4G && \
> mkfs.btrfs /dev/zram0 && \
> mkdir -p /mnt/btrfs-test && \
> mount /dev/zram0 /mnt/btrfs-test && \
> echo "FS Mounted" && \
> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
> echo "Devices added" && \
> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
> bs=32M count=1 && sync ; done
> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test && \
> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test
> =====
>
> The output from the script, I will trim the output down to after the for-loop:
> ------
> Label: none uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.64GiB
> devid 1 size 8.00GiB used 8.00GiB path /dev/zram0
> devid 2 size 8.00GiB used 8.00GiB path /dev/zram1
> devid 3 size 4.00GiB used 0.00B path /dev/zram2
> devid 4 size 4.00GiB used 0.00B path /dev/zram3
> devid 5 size 4.00GiB used 0.00B path /dev/zram4
> devid 6 size 2.00GiB used 0.00B path /dev/zram5
> devid 7 size 2.00GiB used 0.00B path /dev/zram6
> devid 8 size 4.00GiB used 0.00B path /dev/zram7
>
> Data, single: total=15.74GiB, used=15.63GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=264.00MiB, used=17.06MiB
> GlobalReserve, single: total=16.70MiB, used=0.00B
> Done, had to relocate 3 out of 20 chunks
> ERROR: error during balancing '/mnt/btrfs-test': No space left on device
> There may be more info in syslog - try dmesg | tail
> Label: none uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.65GiB
> devid 1 size 8.00GiB used 2.99GiB path /dev/zram0
> devid 2 size 8.00GiB used 2.00GiB path /dev/zram1
> devid 3 size 4.00GiB used 4.00GiB path /dev/zram2
> devid 4 size 4.00GiB used 4.00GiB path /dev/zram3
> devid 5 size 4.00GiB used 4.00GiB path /dev/zram4
> devid 6 size 2.00GiB used 2.00GiB path /dev/zram5
> devid 7 size 2.00GiB used 2.00GiB path /dev/zram6
> devid 8 size 4.00GiB used 4.00GiB path /dev/zram7
>
> Data, single: total=2.00GiB, used=1.97GiB
> Data, RAID6: total=14.41GiB, used=13.66GiB
> System, RAID6: total=80.00MiB, used=16.00KiB
> Metadata, RAID6: total=512.00MiB, used=17.52MiB
> GlobalReserve, single: total=17.16MiB, used=0.00B
> ===
>
> Issueing the umount:
> ----
> # umount /mnt/btrfs-test && modprobe -rv zram
OK, I couldn't reproduce it in my environment (5.4-rc3+ based
btrfs-devel/misc-next form David) with this script. I'll dig deeper.
> Killed
> ===
>
> And last but not least: the output in dmesg:
> ---
> [ 205.960233] BTRFS info (device zram0): 2 enospc errors during balance
> [ 205.960235] BTRFS info (device zram0): balance: ended with status: -28
Here balance ended with -ENOSPC (28).
> [ 235.774821] BUG: kernel NULL pointer dereference, address: 00000000000002ce
That's a NULL pointer deference with an offset of 0x2ce (718).
> [ 235.774826] #PF: supervisor read access in kernel mode
> [ 235.774828] #PF: error_code(0x0000) - not-present page
> [ 235.774830] PGD 0 P4D 0
> [ 235.774834] Oops: 0000 [#1] SMP PTI
> [ 235.774838] CPU: 3 PID: 5421 Comm: umount Not tainted
> 5.3.6-300.fc31.x86_64 #1
> [ 235.774840] Hardware name: LENOVO 80JV/Lenovo U41-70, BIOS BDCN71WW
> 08/03/2016
> [ 235.774847] RIP: 0010:__free_pages+0x5/0x30
> [ 235.774850] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
> e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
> 44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
> e9 82
> [ 235.774852] RSP: 0018:ffffc3cf0ffb7db0 EFLAGS: 00010046
> [ 235.774854] RAX: ffffa0ad0ffa0118 RBX: 0000000000000045 RCX: 0000000000000000
> [ 235.774856] RDX: ffffa0aeceaee2f0 RSI: 0000000000000000 RDI: 000000000000029a
> [ 235.774858] RBP: ffffa0ad0ffa0000 R08: fffffb06c23fd108 R09: fffffb06c23fd108
> [ 235.774860] R10: 0000000000068879 R11: fffffb06c02cf820 R12: 0000000000000045
> [ 235.774861] R13: ffffa0ade5c00010 R14: ffffa0ae0d19e5ac R15: ffffa0ae0d19e578
> [ 235.774864] FS: 00007f3b40f6cc80(0000) GS:ffffa0aeceac0000(0000)
> knlGS:0000000000000000
> [ 235.774866] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 235.774867] CR2: 00000000000002ce CR3: 000000004c1f0004 CR4: 00000000003606e0
CR2 has it as well, so we're page faulting on an access to 0x2ce.
Let's look at __free_pages():
(gdb) l *(__free_pages+0x5)
0xffffffff81179e45 is in __free_pages (mm/page_alloc.c:4818).
4813 __free_pages_ok(page, order);
4814 }
4815
4816 void __free_pages(struct page *page, unsigned int order)
4817 {
4818 if (put_page_testzero(page))
4819 free_the_page(page, order);
4820 }
4821 EXPORT_SYMBOL(__free_pages);
4822
(gdb)
We're getting called by __free_page() so we know order is 0. So
something must have passed in a NULL page (somehow).
Looking at __free_raid_bio() one step further up the call-chain I see this:
for (i = 0; i < rbio->nr_pages; i++) {
if (rbio->stripe_pages[i]) {
__free_page(rbio->stripe_pages[i]);
rbio->stripe_pages[i] = NULL;
}
}
> [ 235.774869] Call Trace:
> [ 235.774920] __free_raid_bio+0x72/0xb0 [btrfs]
> [ 235.774961] btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
> [ 235.774992] close_ctree+0x1ea/0x2f0 [btrfs]
> [ 235.774998] generic_shutdown_super+0x6c/0x100
> [ 235.775001] kill_anon_super+0x14/0x30
> [ 235.775024] btrfs_kill_super+0x12/0xa0 [btrfs]
> [ 235.775029] deactivate_locked_super+0x36/0x70
> [ 235.775033] cleanup_mnt+0x104/0x150
> [ 235.775038] task_work_run+0x87/0xa0
> [ 235.775043] exit_to_usermode_loop+0xda/0x100
> [ 235.775047] do_syscall_64+0x183/0x1a0
> [ 235.775053] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 235.775056] RIP: 0033:0x7f3b411b767b
> [ 235.775060] Code: 08 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
> 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 07 0c 00 f7 d8 64 89
> 01 48
This decodes to:
All code
========
0: 08 0c 00 or %cl,(%rax,%rax,1)
3: f7 d8 neg %eax
5: 64 89 01 mov %eax,%fs:(%rcx)
8: 48 83 c8 ff or $0xffffffffffffffff,%rax
c: c3 retq
d: 66 90 xchg %ax,%ax
f: f3 0f 1e fa endbr64
13: 31 f6 xor %esi,%esi
15: e9 05 00 00 00 jmpq 0x1f
1a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1f: f3 0f 1e fa endbr64
23: b8 a6 00 00 00 mov $0xa6,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <--
trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d dd 07 0c 00 mov 0xc07dd(%rip),%rcx # 0xc0817
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d dd 07 0c 00 mov 0xc07dd(%rip),%rcx # 0xc07ed
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
This doesn't look like __free_pages
(gdb) disassemble __free_pages
Dump of assembler code for function __free_pages:
0xffffffff81179e40 <+0>: lock decl 0x34(%rdi)
0xffffffff81179e44 <+4>: jne 0xffffffff81179e51 <__free_pages+17>
0xffffffff81179e46 <+6>: test %esi,%esi
0xffffffff81179e48 <+8>: je 0xffffffff81179e4f <__free_pages+15>
0xffffffff81179e4a <+10>: jmpq 0xffffffff81178500 <__free_pages_ok>
0xffffffff81179e4f <+15>: jmp 0xffffffff81179de0 <free_unref_page>
0xffffffff81179e51 <+17>: repz retq
End of assembler dump.
(gdb)
--
Johannes Thumshirn SUSE Labs Filesystems
jthumshirn@xxxxxxx +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850