At 02/08/2017 09:56 PM, Filipe Manana wrote:
On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo <quwenruo@xxxxxxxxxxxxxx> wrote:
At 02/07/2017 11:55 PM, Filipe Manana wrote:
On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>
wrote:
At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:
Hi Qu,
On 02/05/2017 07:45 PM, Qu Wenruo wrote:
At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgoldwyn@xxxxxxx>
wrote:
<snipped>
Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.
Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.
Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?
Thanks!
j
Would you please provide the kernel version?
v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance
process:
commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>
Date: Mon Aug 15 10:36:51 2016 +0800
btrfs: relocation: Fix leaking qgroups numbers on data extents
Sorry for that.
And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.
commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>
Date: Tue Oct 18 09:31:29 2016 +0800
btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.
This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.
The cause is already known.
It's find_parent_node() which takes most of the time to find all
referencer
of an extent.
And it's also the cause for FIEMAP softlockup (fixed in recent release by
early quit).
The biggest problem is, current find_parent_node() uses list to iterate,
which is quite slow especially it's done in a loop.
In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce
some
cache for find_parent_node().
Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.
Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.
You must accept the fact that we must call find_parent_node() at least twice
to get correct owner modification for each touched extent.
Or qgroup number will never be correct.
One for old_roots by searching commit root, and one for new_roots by
searching current root.
You can call find_parent_node() as many time as you like, but that's just
wasting your CPU time.
Only the final find_parent_node() will determine new_roots for that extent,
and there is no better timing than commit_transaction().
You're missing my point.
My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and with too high latency - and this is not
theory or speculation, there are upstream reports about this as well
as several in suse's bugzilla, all caused when qgroups are enabled on
4.2+ kernels (when the last qgroups major changes landed).
Judging from that code and from your reply to this and other threads
it seems you didn't understand the consequences of doing all that
accounting stuff inside the critical section of a transaction commit.
NO, I know what you're talking about.
Or I won't send the patch to move half of the find_all_roots() call out
of commit_trans().
(OK, I just like refer to commit_trans() to its critical section, as
quick exit or other part doesn't make much impact in this context)
While it seems that you still don't understand the necessity to call
find_all_roots() in critical section.
It's the base stone of qgroup, or we get back to the qgroup mismatch days.
As I already mentioned, only before we switch fs commit roots, we can
avoid calling extra expensive find_all_roots() and get correct new_roots.
That's to say, at least one find_all_roots() should be called at
critical section for extent modified extent, to keep qgroup correct.
If you could have any better solution to keep qgroup correct and avoid
calling find_all_roots() in critical section, I'm happy to listen.
Forget find_parent_nodes() for a moment, yes it has problems, but it's
not what I'm trying to make you understand. Just look at
btrfs_qgroup_account_extents(), called within the critical section -
it iterates all elements of a red black tree, and each element
corresponds to some data extent allocated in the current transaction -
if we have thousands, tens of thousands, or more, even if whatever the
loop calls had an awesome complexity of O(1) or O(log N) it would
still be bad, exactly because it's blocking future transactions to
start and tasks from joining the current transaction. CPU time and
memory consumption (used for those struct btrfs_qgroup_extent_record)
are also concerns, but to a smaller extent imo.
That's another problem.
For worst case, we could limit the number of delayed ref head before
triggering a transaction commit.
But I still doubt if qgroup extent record is the cause for OOM.
Memory consumption of qgroup_extent_record is quite small, it's only 48
bytes for one extent.
(And delayed ref head is much larger than qgroup_extent_record)
While for one extent, its minimum size is 4K, overhead would be at most 1%.
IIRC before we hit OOM, memory pressure should trigger
commit_transaction(), and freeing all the qgroup_extent_record().
So there is some other problem related to this.
Or you can wasting more time calling find_parent_node() every time you
touched a extent, saving one find_parent_node() in commit_transaction() with
the cost of more find_parent_node() in other place.
Is that what you want?
I can move the find_parent_node() for old_roots out of commit_transaction().
But that will only reduce 50% of the time spent on commit_transaction().
Compared to O(n^3) find_parent_node(), that's not the determining fact even.
Thanks,
Qu
IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but
I
didn't hear anything about it recently.
Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html