On Wed, Feb 13, 2019 at 02:27:04PM +0800, Qu Wenruo wrote:
> From: Josef Bacik <josef@xxxxxxxxxxxxxx>
>
> Qgroups will do the old roots lookup at delayed ref time, which could be
> while walking down the extent root while running a delayed ref. This
> should be fine, except we specifically lock eb's in the backref walking
> code irrespective of path->skip_locking, which deadlocks the system.
> Fix up the backref code to honor path->skip_locking, nobody will be
> modifying the commit_root when we're searching so it's completely safe
> to do.
>
> This happens Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup
> accounting time out of commit trans"), kernel may lockup with quota
> enabled.
>
> There is one backref trace triggered by snapshot dropping along with
> write operation in the source subvolume. The example can be reliably
> reproduced:
>
> btrfs-cleaner D 0 4062 2 0x80000000
> Call Trace:
> schedule+0x32/0x90
> btrfs_tree_read_lock+0x93/0x130 [btrfs]
> find_parent_nodes+0x29b/0x1170 [btrfs]
> btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
> btrfs_find_all_roots+0x57/0x70 [btrfs]
> btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
> btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
> btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
> do_walk_down+0x541/0x5e3 [btrfs]
> walk_down_tree+0xab/0xe7 [btrfs]
> btrfs_drop_snapshot+0x356/0x71a [btrfs]
> btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
> cleaner_kthread+0x12b/0x160 [btrfs]
> kthread+0x112/0x130
> ret_from_fork+0x27/0x50
>
> When dropping snapshots with qgroup enabled, we will trigger backref
> walk.
>
> However such backref walk at that timing is pretty dangerous, as if one
> of the parent nodes get WRITE locked by other thread, we could cause a
> dead lock.
>
> For example:
>
> FS 260 FS 261 (Dropped)
> node A node B
> / \ / \
> node C node D node E
> / \ / \ / \
> leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
>
> The lock sequence would be:
>
> Thread A (cleaner) | Thread B (other writer)
> -----------------------------------------------------------------------
> write_lock(B) |
> write_lock(D) |
> ^^^ called by walk_down_tree() |
> | write_lock(A)
> | write_lock(D) << Stall
> read_lock(H) << for backref walk |
> read_lock(D) << lock owner is |
> the same thread A |
> so read lock is OK |
> read_lock(A) << Stall |
>
> So thread A hold write lock D, and needs read lock A to unlock.
> While thread B holds write lock A, while needs lock D to unlock.
>
> This will cause a deadlock.
>
> This is not only limited to snapshot dropping case.
> As the backref walk, even only happens on commit trees, is breaking the
> normal top-down locking order, makes it deadlock prone.
>
> Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
> CC: stable@xxxxxxxxxxxxxxx # 4.19+
> Reported-and-tested-by: David Sterba <dsterba@xxxxxxxx>
> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
> Reviewed-by: Qu Wenruo <wqu@xxxxxxxx>
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> [ copy logs and deadlock analysis from Qu's patch ]
> [ rebase to latest branch and fix lock assert bug ]
> Signed-off-by: David Sterba <dsterba@xxxxxxxx>
> ---
> changelog:
> v2:
> - Rebased to latest misc-next branch.
> - Fix a lock assert by moving btrfs_set_lock_blocking_read() into the
> same branch of btrfs_tree_read_lock().
Thanks for finding the fix and sending on Josef's behalf. If this
weren't an important bugfix I'd just wait for him to resend. I've added
your signed-off-by as 3/4 of the changelog were copied from your patch
anyway.