On 2018/12/21 上午12:12, Jeff Mahoney wrote:
> On 12/11/18 12:02 AM, Qu Wenruo wrote:
>> [BUG]
>> Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
>> time out of commit trans"), kernel may lockup with quota enabled.
>>
>> There is one backref trace triggered by snapshot dropping along with
>> write operation in the source subvolume.
>> The example can be stably reproduced.
>>
>> btrfs-cleaner D 0 4062 2 0x80000000
>> Call Trace:
>> schedule+0x32/0x90
>> btrfs_tree_read_lock+0x93/0x130 [btrfs]
>> find_parent_nodes+0x29b/0x1170 [btrfs]
>> btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
>> btrfs_find_all_roots+0x57/0x70 [btrfs]
>> btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
>> btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
>> btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
>> do_walk_down+0x541/0x5e3 [btrfs]
>> walk_down_tree+0xab/0xe7 [btrfs]
>> btrfs_drop_snapshot+0x356/0x71a [btrfs]
>> btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
>> cleaner_kthread+0x12b/0x160 [btrfs]
>> kthread+0x112/0x130
>> ret_from_fork+0x27/0x50
>>
>> [CAUSE]
>> When dropping snapshots with qgroup enabled, we will trigger backref
>> walk.
>>
>> However such backref walk at that timing is pretty dangerous, as if one
>> of the parent nodes get WRITE locked by other thread, we could cause a
>> dead lock.
>>
>> For example:
>>
>> FS 260 FS 261 (Dropped)
>> node A node B
>> / \ / \
>> node C node D node E
>> / \ / \ / \
>> leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
>>
>> The lock sequence would be:
>>
>> Thread A (cleaner) | Thread B (other writer)
>> -----------------------------------------------------------------------
>> write_lock(B) |
>> write_lock(D) |
>> ^^^ called by walk_down_tree() |
>> | write_lock(A)
>> | write_lock(D) << Stall
>> read_lock(H) << for backref walk |
>> read_lock(D) << lock owner is |
>> the same thread A |
>> so read lock is OK |
>> read_lock(A) << Stall |
>>
>> So thread A hold write lock D, and needs read lock A to unlock.
>> While thread B holds write lock A, while needs lock D to unlock.
>>
>> This will cause a dead lock.
>>
>> This is not only limited to snapshot dropping case.
>> As the backref walk, even only happens on commit trees, is breaking the
>> normal top-down locking order, makes it deadlock prone.
>>
>> [FIX]
>> The solution is not to trigger backref walk with any write lock hold.
>> This means we can't do backref walk at delayed ref insert time.
>>
>> So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
>> Move half of the qgroup accounting time out of commit trans").
>>
>> Reported-by: David Sterba <dsterba@xxxxxxx>
>> Reported-by: Filipe Manana <fdmanana@xxxxxxxx>
>> Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> fs/btrfs/delayed-ref.c | 6 ------
>> fs/btrfs/qgroup.c | 35 ++---------------------------------
>> 2 files changed, 2 insertions(+), 39 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..7aaae8436986 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>> if (ret > 0)
>> kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
>>
>> - if (qrecord_inserted)
>> - btrfs_qgroup_trace_extent_post(fs_info, record);
>> -
>> return 0;
>> }
>>
>> @@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>> if (ret > 0)
>> kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>>
>> -
>> - if (qrecord_inserted)
>> - return btrfs_qgroup_trace_extent_post(fs_info, record);
>> return 0;
>> }
>
>
> This looks good but is leaving some cruft behind.
>
> These were the only uses for qrecord_inserted so we don't need it in
> either function and we don't need to pass it to add_delayed_ref_head.
This version is a little old, the latest one can be found titlted:
[PATCH v3 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref
insert time
Or at patchwork:
https://patchwork.kernel.org/patch/10725371/
Thanks,
Qu
>
> -Jeff
>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 45868fd76209..7d485b1e0efb 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>> return 0;
>> }
>>
>> -int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
>> - struct btrfs_qgroup_extent_record *qrecord)
>> -{
>> - struct ulist *old_root;
>> - u64 bytenr = qrecord->bytenr;
>> - int ret;
>> -
>> - ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
>> - if (ret < 0) {
>> - fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>> - btrfs_warn(fs_info,
>> -"error accounting new delayed refs extent (err code: %d), quota inconsistent",
>> - ret);
>> - return 0;
>> - }
>> -
>> - /*
>> - * Here we don't need to get the lock of
>> - * trans->transaction->delayed_refs, since inserted qrecord won't
>> - * be deleted, only qrecord->node may be modified (new qrecord insert)
>> - *
>> - * So modifying qrecord->old_roots is safe here
>> - */
>> - qrecord->old_roots = old_root;
>> - return 0;
>> -}
>> -
>> int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>> u64 num_bytes, gfp_t gfp_flag)
>> {
>> @@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>> kfree(record);
>> return 0;
>> }
>> - return btrfs_qgroup_trace_extent_post(fs_info, record);
>> + return 0;
>> }
>>
>> int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>> @@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>> trace_btrfs_qgroup_account_extents(fs_info, record);
>>
>> if (!ret) {
>> - /*
>> - * Old roots should be searched when inserting qgroup
>> - * extent record
>> - */
>> - if (WARN_ON(!record->old_roots)) {
>> + if (!record->old_roots) {
>> /* Search commit root to find old_roots */
>> ret = btrfs_find_all_roots(NULL, fs_info,
>> record->bytenr, 0,
>>
>
Attachment:
signature.asc
Description: OpenPGP digital signature
