Here is a fresh snapshot hang example:
Here kworker/u24:0 is blocked waiting on btrfs_tree_read_lock.
And again underlaying is btrfs_search_slot after it has gone through
cache_block_group.
The other two has blocked on the same place in cache_block_group. So I
can confirm it is hanging the same way as with umount.
Here is the struct extent_buffer passed to btrfs_tree_read_lock for
kworker/u24:0:
struct extent_buffer {
...
lock_owner = 23286,
write_locks = {
counter = 1
},
read_locks = {
counter = 0
},
blocking_writers = {
counter = 1
},
..
}
So again the ioctl has taken the lock needed for its underlying operation.
Btrfs pid 23286 is going to perform a write, so it can't skip locking,
and not just read locking, but write locking.
The fact that it owns the wr lock is good. skip_locking in its
btrfs_path struct is zero, so till here it is fine and locked. The
problem rises later, if cache isnt available. Then we are about to try
to do cache_block_group. Even if it is not going to hang on the
wait in the beginning in the cache_block_group function (as in current case):
while (cache->cached == BTRFS_CACHE_FAST) {
struct btrfs_caching_control *ctl;
ctl = cache->caching_ctl;
atomic_inc(&ctl->count);
prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
spin_unlock(&cache->lock);
> schedule();
finish_wait(&ctl->wait, &wait);
put_caching_control(ctl);
spin_lock(&cache->lock);
}
it seems that there is high chance to hang later on
btrfs_tree_read_lock even if there is absolutely no concurrent work
going on by calling:
cache_block_group
->load_free_space_cache
->lookup_free_space_inode
->__lookup_free_space_inode
->btrfs_search_slot
->btrfs_read_lock_root_node
->btrfs_tree_read_lock
since we are already holding the write lock.
It looks like if we call cache_block_group at find_free_extent under
call via btrfs_search_slot we are about to get deadlocked too in case
the data isn't cached.
It looks scary. I hope that I've missed something.
Best regards,
Angel
On Thu, Aug 10, 2017 at 1:30 AM, Angel Shtilianov <kernel@xxxxxxxx> wrote:
> Hi Jeff,
> That's exactly what I am observing in the backtraces.
> umount reaches cache_block_group passing through btrfs_search_slot,
> which looks like got the wr_lock (to be confirmed tomorrow when I'm in
> the office) and all other btrfs worker threads are fighting for the
> btrfs_tree_read_lock.
> Umount is hanging on the cache_block_group wait, all others are
> hanging on the read_lock. So caching seems to be the bottleneck,
> blocking all others. Good catch! I'll take a look at the commit
> tomorrow morning (with large cup of coffee of course).
>
> By the way, looking through the logs I can see one more pattern -
> umount after drop snapshot reaches similar state. The common is that
> drop snapshot will also call search slot, get the readlock, drop it
> and get the write lock. In the same time there is umount causing
> fs_sync going concurently. I've seen at least three such hangs today
> and I've started the corresponding crashdumps to get copied to my
> machine. Tomorrow I'll take a look at them. It seems the common is
> that umount again reaches to the same place - cache_block_group and
> hangs. Tomorrow I'll take a look at the crash dump for the workers
> state.
>
> Thank you,
> Angel
>
> On Wed, Aug 9, 2017 at 6:42 PM, Jeff Mahoney <jeffm@xxxxxxxx> wrote:
>> On 8/8/17 7:30 AM, Angel Shtilianov wrote:
>>> crash> bt -f 31625
>>> PID: 31625 TASK: ffff88046a833400 CPU: 7 COMMAND: "btrfs-transacti"
>>> wants to acquire struct extent_buffer ffff88000460aca0 lock, whose
>>> lock_owner is 27574.
>>>
>>> here is pid 27574:
>>> PID: 27574 TASK: ffff88038b469a00 CPU: 4 COMMAND: "kworker/u32:9"
>>> which is also is trying to acquire eb lock ffff8802598b6200, and here
>>> the owner is 31696.
>>>
>>> 31696 is
>>> PID: 31696 TASK: ffff88044b59ce00 CPU: 5 COMMAND: "umount"
>>>
>>> So definitely here is a kind of deadlock.
>>> umount holds the lock needed by the workers to complete and waits them
>>> to complete.
>>> Lockdep wouldn't complain about that.
>>> I am still about to investigate what has previously triggered/disabled lockdep.
>>> I have to obtain the log from the machine, but I need some time to get it.
>>>
>>> Jeff, you were right.
>>> Could you help demystifying how we ended up here?
>>
>> Hi Angel -
>>
>> It looks like a regression introduced by 291c7d2f5, but that's a very
>> old commit. As that commit says, it's a rare occurence to hit that
>> wait, and that's probably why we haven't seen this issue sooner.
>>
>> There's potential for this to happen whenever two threads are modifying
>> the tree at once and one needs to find a free extent. I'll need to
>> think a bit on how to fix it.
>>
>> -Jeff
>>
>>> Best regards,
>>> Angel
>>>
>>> On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoney <jeffm@xxxxxxxx> wrote:
>>>> On 8/7/17 1:19 PM, Jeff Mahoney wrote:
>>>>> On 8/7/17 10:12 AM, Angel Shtilianov wrote:
>>>>>> Hi there,
>>>>>> I'm investigating sporadic hanging during btrfs umount. The FS is
>>>>>> contained in a loop mounted file.
>>>>>> I have no reproduction scenario and the issue may happen once a day or
>>>>>> once a month. It is rare, but frustrating.
>>>>>> I have a crashdump (the server has been manually crashed and collected
>>>>>> a crashdump), so I could take look through the data structures.
>>>>>> What happens is that umount is getting in D state and a the kernel
>>>>>> complains about hung tasks. We are using kernel 4.4.y The actual back
>>>>>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
>>>>>> used (4.4.30 through 4.4.70).
>>>>>> Tasks like:
>>>>>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
>>>>>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
>>>>>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
>>>>>> are getting blocked waiting for btrfs_tree_read_lock, which is owned
>>>>>> by task umount:31696 (which is also blocked for more than 120 seconds)
>>>>>> regarding the lock debug.
>>>>>>
>>>>>> umount is hung in "cache_block_group", see the '>' mark:
>>>>>> while (cache->cached == BTRFS_CACHE_FAST) {
>>>>>> struct btrfs_caching_control *ctl;
>>>>>>
>>>>>> ctl = cache->caching_ctl;
>>>>>> atomic_inc(&ctl->count);
>>>>>> prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
>>>>>> spin_unlock(&cache->lock);
>>>>>>
>>>>>>> schedule();
>>>>>>
>>>>>> finish_wait(&ctl->wait, &wait);
>>>>>> put_caching_control(ctl);
>>>>>> spin_lock(&cache->lock);
>>>>>> }
>>>>>>
>>>>>> The complete backtraces could be found in the attached log.
>>>>>>
>>>>>> Do you have any ideas ?
>>>>>
>>>>> Hi Angel -
>>>>>
>>>>> In your log, it says lockdep is disabled. What tripped it earlier?
>>>>> Lockdep really should be catching locking deadlocks in situations like
>>>>> this, if that's really the underlying cause.
>>>>
>>>> Actually, I'm not sure if lockdep would catch this one. Here's my
>>>> hypothesis:
>>>>
>>>> kworker/u32:9 is waiting on a read lock while reading the free space
>>>> cache, which means it owns the cache->cached value and will issue the
>>>> wakeup when it completes.
>>>>
>>>> umount is waiting on for the wakeup from kworker/u32:9 but is holding
>>>> some tree locks in write mode.
>>>>
>>>> If kworker/u32:9 is waiting on the locks that umount holds, we have a
>>>> deadlock.
>>>>
>>>> Can you dump the extent buffer that kworker/u32:9 is waiting on? Part
>>>> of that will contain the PID of the holder, and if matches umount, we
>>>> found the cause.
>>>>
>>>> -Jeff
>>>>
>>>> --
>>>> Jeff Mahoney
>>>> SUSE Labs
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> --
>> Jeff Mahoney
>> SUSE Labs
>>
[2003789.023934] sysrq: SysRq : Show Blocked State
[2003789.024159] task PC stack pid father
[2003789.024613] kworker/u24:0 D ffff88037d363470 0 17905 2 0x00000000
[2003789.025027] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[2003789.025378] ffff88037d363470 ffff88026bbd1850 ffff88037d363438 ffffffff00000000
[2003789.025857] ffff88046bf78000 ffff88000543b400 ffff88037d364000 ffff88026bbd1850
[2003789.026341] ffff88000543b400 ffff88000543b400 ffff88037d3634b0 ffff88037d363488
[2003789.026817] Call Trace:
[2003789.026992] [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.027190] [<ffffffffa0ac30b5>] btrfs_tree_read_lock+0xe5/0x140 [btrfs]
[2003789.027399] [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.027594] [<ffffffffa0a59584>] btrfs_read_lock_root_node+0x34/0x50 [btrfs]
[2003789.030864] [<ffffffffa0a5e4db>] btrfs_search_slot+0x70b/0x9d0 [btrfs]
[2003789.031046] [<ffffffff810812d2>] ? __might_sleep+0x52/0xb0
[2003789.031244] [<ffffffffa0a7a3bf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[2003789.031439] [<ffffffffa0a925e7>] btrfs_iget+0xd7/0x6a0 [btrfs]
[2003789.031629] [<ffffffffa0acb1df>] __lookup_free_space_inode+0xdf/0x150 [btrfs]
[2003789.031925] [<ffffffffa0accf0b>] lookup_free_space_inode+0x6b/0xe0 [btrfs]
[2003789.032109] [<ffffffffa0ace4ca>] load_free_space_cache+0x7a/0x1b0 [btrfs]
[2003789.032300] [<ffffffffa0a637b7>] cache_block_group+0x1b7/0x3d0 [btrfs]
[2003789.032474] [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.032656] [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.032838] [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.033020] [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.033210] [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.033399] [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.033586] [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.033774] [<ffffffffa0a5ff71>] btrfs_insert_empty_items+0x71/0xc0 [btrfs]
[2003789.034075] [<ffffffff811b69d3>] ? kmem_cache_alloc+0x203/0x2f0
[2003789.034267] [<ffffffffa0a6001e>] btrfs_insert_item+0x5e/0xd0 [btrfs]
[2003789.034458] [<ffffffffa0a6fa91>] btrfs_create_pending_block_groups+0x111/0x210 [btrfs]
[2003789.034764] [<ffffffffa0a86814>] __btrfs_end_transaction+0x94/0x340 [btrfs]
[2003789.035070] [<ffffffffa0a86ad0>] btrfs_end_transaction+0x10/0x20 [btrfs]
[2003789.035267] [<ffffffffa0a70f20>] flush_space+0xf0/0x4b0 [btrfs]
[2003789.035460] [<ffffffffa0a7140a>] ? btrfs_async_reclaim_metadata_space+0x12a/0x1c0 [btrfs]
[2003789.035767] [<ffffffffa0a7140a>] ? btrfs_async_reclaim_metadata_space+0x12a/0x1c0 [btrfs]
[2003789.036072] [<ffffffffa0a71470>] ? btrfs_async_reclaim_metadata_space+0x190/0x1c0 [btrfs]
[2003789.036384] [<ffffffffa0a71402>] btrfs_async_reclaim_metadata_space+0x122/0x1c0 [btrfs]
[2003789.036680] [<ffffffff81073766>] process_one_work+0x1e6/0x6b0
[2003789.036855] [<ffffffff810736dd>] ? process_one_work+0x15d/0x6b0
[2003789.037028] [<ffffffff81073c99>] worker_thread+0x69/0x480
[2003789.037214] [<ffffffff81073c30>] ? process_one_work+0x6b0/0x6b0
[2003789.037396] [<ffffffff8107a37d>] kthread+0x11d/0x140
[2003789.037573] [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.037751] [<ffffffff816ba8ff>] ret_from_fork+0x3f/0x70
[2003789.037925] [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.038117] btrfs-transacti D ffff88039dcfb6a8 0 23285 2 0x00000000
[2003789.038513] ffff88039dcfb6a8 ffff88046a26d4b0 ffff88039dcfb670 0000000000000000
[2003789.038991] ffff88046bf59a00 ffff880003ac3400 ffff88039dcfc000 ffff880312a1c040
[2003789.039476] ffff880126a6ba00 ffff88046a26d4b0 ffff88046a26d400 ffff88039dcfb6c0
[2003789.039953] Call Trace:
[2003789.040124] [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.040316] [<ffffffffa0a63736>] cache_block_group+0x136/0x3d0 [btrfs]
[2003789.040495] [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.040681] [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.040868] [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.041055] [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.041252] [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.041441] [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.041627] [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.041813] [<ffffffffa0a65175>] lookup_inline_extent_backref+0xc5/0x5b0 [btrfs]
[2003789.042107] [<ffffffff810812d2>] ? __might_sleep+0x52/0xb0
[2003789.042301] [<ffffffffa0a5916a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[2003789.042492] [<ffffffffa0a67701>] __btrfs_free_extent.isra.67+0xf1/0xcb0 [btrfs]
[2003789.042800] [<ffffffffa0ad385d>] ? btrfs_merge_delayed_refs+0x8d/0x630 [btrfs]
[2003789.043102] [<ffffffffa0a6b982>] ? __btrfs_run_delayed_refs+0x192/0x1630 [btrfs]
[2003789.043416] [<ffffffffa0a6c0c0>] __btrfs_run_delayed_refs+0x8d0/0x1630 [btrfs]
[2003789.043715] [<ffffffff810d4ea4>] ? lock_timer_base.isra.35+0x54/0x70
[2003789.043906] [<ffffffffa0a842eb>] ? join_transaction.isra.13+0x2b/0x440 [btrfs]
[2003789.044218] [<ffffffffa0a6fc12>] btrfs_run_delayed_refs+0x82/0x2c0 [btrfs]
[2003789.044412] [<ffffffffa0a855f3>] btrfs_commit_transaction+0x43/0xb10 [btrfs]
[2003789.044719] [<ffffffffa0a809ac>] transaction_kthread+0x22c/0x290 [btrfs]
[2003789.044907] [<ffffffffa0a80780>] ? btrfs_cleanup_transaction+0x650/0x650 [btrfs]
[2003789.045210] [<ffffffff8107a37d>] kthread+0x11d/0x140
[2003789.045385] [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.045560] [<ffffffff816ba8ff>] ret_from_fork+0x3f/0x70
[2003789.045735] [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.045910] btrfs D ffff88026b22b880 0 23286 23202 0x00000000
[2003789.046291] ffff88026b22b880 ffff88046a26d4b0 ffff88026b22b848 0000000000000000
[2003789.046767] ffff88046bf5b400 ffff8800378e3400 ffff88026b22c000 ffff880312a1c040
[2003789.047241] ffff88046572b400 ffff88046a26d4b0 ffff88046a26d400 ffff88026b22b898
[2003789.047720] Call Trace:
[2003789.047890] [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.048076] [<ffffffffa0a63736>] cache_block_group+0x136/0x3d0 [btrfs]
[2003789.048260] [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.048449] [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.048638] [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.048827] [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.049015] [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.049211] [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.049414] [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.049601] [<ffffffffa0a5ff71>] btrfs_insert_empty_items+0x71/0xc0 [btrfs]
[2003789.049909] [<ffffffffa0ac3758>] btrfs_insert_orphan_item+0x58/0x80 [btrfs]
[2003789.050227] [<ffffffffa0abae48>] btrfs_ioctl_snap_destroy+0x4b8/0x7e0 [btrfs]
[2003789.050542] [<ffffffffa0ac0e89>] btrfs_ioctl+0x899/0x27e0 [btrfs]
[2003789.050720] [<ffffffff816b9ba7>] ? _raw_spin_unlock+0x27/0x40
[2003789.050897] [<ffffffff8119068d>] ? handle_mm_fault+0xdbd/0x1b80
[2003789.051073] [<ffffffff8118f926>] ? handle_mm_fault+0x56/0x1b80
[2003789.051257] [<ffffffff811ea6ef>] do_vfs_ioctl+0x30f/0x580
[2003789.051434] [<ffffffff810aa26f>] ? up_read+0x1f/0x40
[2003789.051605] [<ffffffff811f6796>] ? __fget_light+0x66/0x90
[2003789.051775] [<ffffffff811ea9d9>] SyS_ioctl+0x79/0x90
[2003789.051948] [<ffffffff816ba55b>] entry_SYSCALL_64_fastpath+0x16/0x73
[2003848.780864] sysrq: SysRq : Trigger a crash
[2003848.781093] BUG: unable to handle kernel NULL pointer dereference at (null)