Re: Writing a large file causes odd freeze

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



No qgroups, just a big FS to dump backups.
I'll try it without free space cache too.

I appreciate all the suggestions!
On Tue, Oct 2, 2018 at 7:22 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2018/10/2 下午9:31, Gerard Saraber wrote:
> > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > by ungzipping a large file.
> > The filesystem scrubs complete without errors and smartd reports no
> > errors at the moment.
> >
> > The command is:
> > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > At around 31GB it quits writing to the output file, and I start
> > getting the 'hung task' messages in my kernel log.
> > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > dump files.
>
> I assume you're not using qgroups, right?
>
> Just to rule out some obvious transaction slowdown causes.
>
> BTW, as you're already using space_cache=v2, what about disable free
> space cache completely?
> If it's really caused by space cache writing at transaction committing
> time, disabling it more or release reduce the hang time.
>
> Thanks,
> Qu
>
> >
> > after that 'ps auwx' completely hangs and I can't write to the
> > filesystem anymore, no errors, the processes just hang. I can still
> > read from the FS as far as I can tell.
> > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > of bug (may not even be BTRFS related)
> >
> > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > x86_64 GNU/Linux
> > [ I think It used to work on 4.18.0-rcX ]
> >
> > Hardware specs:
> >  AMD Ryzen Threadripper 1950X 16-Core Processor
> > 128GB ram
> > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > of LSI cards
> >
> > root@cloud1:~# btrfs fi usage /btrfsarray
> > Overall:
> >     Device size:                  79.14TiB
> >     Device allocated:             55.38TiB
> >     Device unallocated:           23.76TiB
> >     Device missing:                  0.00B
> >     Used:                         55.30TiB
> >     Free (estimated):             11.91TiB      (min: 11.91TiB)
> >     Data ratio:                       2.00
> >     Metadata ratio:                   2.00
> >     Global reserve:              512.00MiB      (used: 0.00B)
> >
> > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> >    /dev/sda        1.64TiB
> >    /dev/sdb        1.64TiB
> >    /dev/sdc        1.87TiB
> >    /dev/sdd        4.37TiB
> >    /dev/sde        2.55TiB
> >    /dev/sdf        2.55TiB
> >    /dev/sdg        4.37TiB
> >    /dev/sdh        4.36TiB
> >    /dev/sdi        2.55TiB
> >    /dev/sdj        4.36TiB
> >    /dev/sdk        4.37TiB
> >    /dev/sdl        1.64TiB
> >    /dev/sdm        4.37TiB
> >    /dev/sdn        2.54TiB
> >    /dev/sdo        2.55TiB
> >    /dev/sdp        2.55TiB
> >    /dev/sdq        2.55TiB
> >    /dev/sdr        4.37TiB
> >
> > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> >    /dev/sda        5.00GiB
> >    /dev/sdb        5.00GiB
> >    /dev/sdd       14.00GiB
> >    /dev/sde       12.00GiB
> >    /dev/sdf        8.00GiB
> >    /dev/sdg        6.00GiB
> >    /dev/sdh       20.00GiB
> >    /dev/sdi       11.00GiB
> >    /dev/sdj       19.00GiB
> >    /dev/sdk        5.00GiB
> >    /dev/sdl        6.00GiB
> >    /dev/sdm        7.00GiB
> >    /dev/sdn       17.00GiB
> >    /dev/sdo        7.00GiB
> >    /dev/sdp       13.00GiB
> >    /dev/sdq       12.00GiB
> >    /dev/sdr        7.00GiB
> >
> > System,RAID1: Size:32.00MiB, Used:5.19MiB
> >    /dev/sdc       32.00MiB
> >    /dev/sdr       32.00MiB
> >
> > Unallocated:
> >    /dev/sda        1.08TiB
> >    /dev/sdb        1.08TiB
> >    /dev/sdc        5.40TiB
> >    /dev/sdd        1.08TiB
> >    /dev/sde        1.08TiB
> >    /dev/sdf        1.08TiB
> >    /dev/sdg        1.08TiB
> >    /dev/sdh        1.08TiB
> >    /dev/sdi        1.08TiB
> >    /dev/sdj        1.08TiB
> >    /dev/sdk        1.08TiB
> >    /dev/sdl        1.08TiB
> >    /dev/sdm        1.08TiB
> >    /dev/sdn        1.08TiB
> >    /dev/sdo        1.08TiB
> >    /dev/sdp        1.08TiB
> >    /dev/sdq        1.08TiB
> >    /dev/sdr        1.08TiB
> >
> >
> > Oct  2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > blocked for more than 120 seconds.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793068]       Not tainted 4.18.10 #1
> > Oct  2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793094] smbd            D    0
> > 47578   3375 0x00000000
> > Oct  2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > Oct  2 08:02:25 cloud1 kernel: [165537.793098]  ? __schedule+0x299/0x880
> > Oct  2 08:02:25 cloud1 kernel: [165537.793100]  schedule+0x28/0x80
> > Oct  2 08:02:25 cloud1 kernel: [165537.793102]  wait_current_trans+0xad/0xe0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793104]  ? wait_woken+0x80/0x80
> > Oct  2 08:02:25 cloud1 kernel: [165537.793106]  start_transaction+0x2ee/0x3c0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793108]  btrfs_sync_file+0x279/0x400
> > Oct  2 08:02:25 cloud1 kernel: [165537.793110]
> > btrfs_file_write_iter+0x461/0x576
> > Oct  2 08:02:25 cloud1 kernel: [165537.793111]  __vfs_write+0x114/0x1a0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793113]  vfs_write+0xad/0x1a0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793115]  ksys_pwrite64+0x71/0x90
> > Oct  2 08:02:25 cloud1 kernel: [165537.793116]  ? __switch_to_asm+0x34/0x70
> > Oct  2 08:02:25 cloud1 kernel: [165537.793117]  do_syscall_64+0x4f/0x100
> > Oct  2 08:02:25 cloud1 kernel: [165537.793119]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > Oct  2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > Oct  2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > RBX: 0000000000000009 RCX: 00007f82af38104f
> > Oct  2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > RSI: 000055e06183fea0 RDI: 0000000000000009
> > Oct  2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > R08: 0000000000000000 R09: 00007fffac636080
> > Oct  2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > R11: 0000000000000293 R12: 0000000000001000
> > Oct  2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> >
> >
> > Oct  2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > btrfs-transacti:12370 blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622366]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D    0
> > 12370      2 0x80000000
> > Oct  2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622406]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622408]  ?
> > _raw_spin_unlock_irqrestore+0xa/0x10
> > Oct  2 08:04:25 cloud1 kernel: [165658.622410]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622414]
> > btrfs_commit_transaction+0x760/0x870
> > Oct  2 08:04:25 cloud1 kernel: [165658.622418]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622420]  transaction_kthread+0x12e/0x150
> > Oct  2 08:04:25 cloud1 kernel: [165658.622422]  ?
> > btrfs_cleanup_transaction+0x520/0x520
> > Oct  2 08:04:25 cloud1 kernel: [165658.622425]  kthread+0x113/0x130
> > Oct  2 08:04:25 cloud1 kernel: [165658.622427]  ?
> > kthread_create_worker_on_cpu+0x70/0x70
> > Oct  2 08:04:25 cloud1 kernel: [165658.622428]  ret_from_fork+0x22/0x40
> > Oct  2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622501]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622527] jsvc            D    0
> > 15530  15312 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622531]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622533]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622536]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622538]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622540]  btrfs_page_mkwrite+0x1f3/0x500
> > Oct  2 08:04:25 cloud1 kernel: [165658.622544]  do_page_mkwrite+0x31/0x90
> > Oct  2 08:04:25 cloud1 kernel: [165658.622546]  do_wp_page+0x214/0x5a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622548]  __handle_mm_fault+0xb6a/0x1260
> > Oct  2 08:04:25 cloud1 kernel: [165658.622551]  ? __seccomp_filter+0x44/0x4c0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622553]  handle_mm_fault+0xc6/0x200
> > Oct  2 08:04:25 cloud1 kernel: [165658.622556]  __do_page_fault+0x24c/0x4d0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622558]  ? page_fault+0x8/0x30
> > Oct  2 08:04:25 cloud1 kernel: [165658.622559]  page_fault+0x1e/0x30
> > Oct  2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > Oct  2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > 002b:00007f92dccced60 EFLAGS: 00010246
> > Oct  2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > Oct  2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > RSI: 00007f92dccced30 RDI: 0000000000000001
> > Oct  2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > Oct  2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > R11: 0000000000000001 R12: 0000000000000028
> > Oct  2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > R14: 0000000000000002 R15: 0000000000000001
> > Oct  2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622628]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622653] mongod          D    0
> > 23344  15560 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622657]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622659]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622661]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622663]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622665]
> > btrfs_wait_ordered_range+0xbb/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622667]  btrfs_sync_file+0x30c/0x400
> > Oct  2 08:04:25 cloud1 kernel: [165658.622671]  do_fsync+0x38/0x60
> > Oct  2 08:04:25 cloud1 kernel: [165658.622673]  __x64_sys_fdatasync+0x13/0x20
> > Oct  2 08:04:25 cloud1 kernel: [165658.622675]  do_syscall_64+0x4f/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622677]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > Oct  2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > Oct  2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > Oct  2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > RSI: 000000000000000c RDI: 000000000000000c
> > Oct  2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > Oct  2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > R14: 0000000000000000 R15: 00005633b39a20f8
> > Oct  2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622702]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622728] ftdc            D    0
> > 23347  15560 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622731]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622733]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622735]  wait_current_trans+0xad/0xe0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622737]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622738]  start_transaction+0x2ee/0x3c0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622740]  btrfs_create+0x57/0x1f0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622743]  path_openat+0x13c1/0x16a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622746]  do_filp_open+0x9b/0x110
> > Oct  2 08:04:25 cloud1 kernel: [165658.622748]  ? try_lookup_one_len+0x70/0x70
> > Oct  2 08:04:25 cloud1 kernel: [165658.622751]  ? close_pdeo+0x93/0xf0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622753]  ? __check_object_size+0xa7/0x1a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622756]  ? __alloc_fd+0x3d/0x160
> > Oct  2 08:04:25 cloud1 kernel: [165658.622758]  ? do_sys_open+0x1bd/0x250
> > Oct  2 08:04:25 cloud1 kernel: [165658.622760]  do_sys_open+0x1bd/0x250
> > Oct  2 08:04:25 cloud1 kernel: [165658.622762]  do_syscall_64+0x4f/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622763]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > Oct  2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > Oct  2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > Oct  2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > Oct  2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > R08: 0000000000000000 R09: 00005633b1077b14
> > Oct  2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > R11: 0000000000000293 R12: 0000000000000004
> > Oct  2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > R14: 00007f3d302ec720 R15: 00005633b9942004
> >
>




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux