Re: [PATCH v4 0/7] btrfs: qgroup: Delay subtree scan to reduce overhead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/1/16 上午9:15, Josef Bacik wrote:
> On Wed, Jan 16, 2019 at 09:07:26AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/1/16 上午8:31, Qu Wenruo wrote:
>>>
>>>
>>> On 2019/1/16 上午1:26, Josef Bacik wrote:
>>>> On Tue, Jan 15, 2019 at 04:15:57PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree
>>>>>
>>>>> Which is based on v5.0-rc1.
>>>>>
>>>>
>>>> I've been testing these patches hoping to get rid of the qgroup deadlock that
>>>> these patches are supposed to fix, but instead they make the box completely
>>>> unusable with 100% cpu usage for minutes at a time at every transaction commit.
>>>
>>> I'm afraid it's related to the v5.0-rc1 base, not the patchset itself.
>>>
>>> Just try to balance metadata with 16 snapshots, you'll see btrfs bumping
>>> its generation like crazy, no matter if quota is enabled or not.
>>>
>>> And since btrfs is committing transaction like crazy, no wonder it will
>>> do tons of qgroup accounting.
>>>
>>> My bisect leads to commit 64403612b73a94bc7b02cf8ca126e3b8ced6e921
>>> btrfs: rework btrfs_check_space_for_delayed_refs.
>>
>> Furthermore, you could try this RFC test case to see the problem.
>> https://patchwork.kernel.org/patch/10761715/
>>
>> It would only take around 100s for v4.20 but over 500 for v5.0-rc1 with
>> tons of unnecessary transaction committed for nothing, no quota enabled.
>>
>> So I'm afraid that commit is blocking my qgroup patchset.
>>
> 
> I've fixed the balance problem, it took 2 seconds to figure out, I'm just
> waiting on xfstests to finish running.
> 
> And your patch making things worse has nothing to do with that problem.  Our
> test doesn't run balance, so the issue you reported has nothing to do with the
> fact that your patch makes our boxes unusable with qgroups on.
> 
> The problem is with your deadlock avoidence patch we're now making a lot more
> dirty extents to run in the critical section of the transaction commit.  Also
> because we're no longer pre-fetching the old roots, we're doing the old roots
> and new roots lookup inside the critical section, so now each dirty extent takes
> 2x as long.  With my basic test it was taking 5 minutes to do the qgroup
> accounting, which keeps the box from booting essentially.
> 
> Without your patch it's still awful because btrfs-cleaner just sits there at
> 100% while deleting snapshots, but at least it's not making the whole system
> stop running while it does all that work in the transaction commit.
> 
> And if you had done the proper root cause analysis you would have noticed that
> we're taking tree locks in the find_parent_nodes() case when we're searching the
> commit root, something we shouldn't be doing.  So all that really needs to be
> done is to check if (!path->skip_locking) btrfs_tree_read_lock(eb); in those
> cases and the deadlock goes away.  Because no matter what we shouldn't be taking
> locks when we're not given a trans in the backref lookup code.

That indeed looks much better than my current solution.

Although I'm not 100% sure for cases like tree blocks shared between
commit and current root (tree block not modified yet).

I'll definitely invest more time to try to fix this bug.

Thanks,
Qu

>  So the fact that
> we were doing just that and thus deadlocking should have been a red flag.
> 
> I will be sending these patches in the morning, once all of the various testing
> that should take place on patches is complete.  The balance patches you have for
> qgroups don't appear to be a problem, but that deadlock one is bogus and needs
> to be dropped.  Thanks,
> 
> Josef
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux