On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
> It looks pretty like qgroup, but too many noise.
> The pin point trace event would btrfs_find_all_roots().
I had this half-written when you replied.
Agreed: looks like bulk of time spent resides in qgroups. Spent some
time with sysrq-l and ftrace:
? __rcu_read_unlock+0x5/0x50
? return_to_handler+0x15/0x36
__rcu_read_unlock+0x5/0x50
find_extent_buffer+0x47/0x90 extent_io.c:4888
read_block_for_search.isra.12+0xc8/0x350 ctree.c:2399
btrfs_search_slot+0x3e7/0x9c0 ctree.c:2837
btrfs_next_old_leaf+0x1dc/0x410 ctree.c:5702
btrfs_next_old_item ctree.h:2952
add_all_parents backref.c:487
resolve_indirect_refs+0x3f7/0x7e0 backref.c:575
find_parent_nodes+0x42d/0x1290 backref.c:1236
? find_parent_nodes+0x5/0x1290 backref.c:1114
btrfs_find_all_roots_safe+0x98/0x100 backref.c:1414
btrfs_find_all_roots+0x52/0x70 backref.c:1442
btrfs_qgroup_trace_extent_post+0x27/0x60 qgroup.c:1503
btrfs_qgroup_trace_leaf_items+0x104/0x130 qgroup.c:1589
btrfs_qgroup_trace_subtree+0x26a/0x3a0 qgroup.c:1750
do_walk_down+0x33c/0x5a0 extent-tree.c:8883
walk_down_tree+0xa8/0xd0 extent-tree.c:9041
btrfs_drop_snapshot+0x370/0x8b0 extent-tree.c:9203
merge_reloc_roots+0xcf/0x220
btrfs_recover_relocation+0x26d/0x400
? btrfs_cleanup_fs_roots+0x16a/0x180
btrfs_remount+0x32e/0x510
do_remount_sb+0x67/0x1e0
do_mount+0x712/0xc90
The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by
the following ftrace filter:
fileserver:/sys/kernel/tracing# cat set_ftrace_filter
btrfs_qgroup_trace_extent
btrfs_qgroup_trace_subtree
# cat trace
...
mount-6803 [003] .... 80407.649752:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree
mount-6803 [003] .... 80407.649772:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.649797:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.649821:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.649846:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.701652:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.754547:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.754574:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.754598:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.754622:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [003] .... 80407.754646:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
... repeats 240 times
mount-6803 [002] .... 80412.568804:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [002] .... 80412.568825:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
mount-6803 [002] .... 80412.568850:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree
mount-6803 [002] .... 80412.568872:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
Looks like invocations of btrfs_qgroup_trace_subtree are taking forever:
mount-6803 [006] .... 80641.627709:
btrfs_qgroup_trace_subtree <-do_walk_down
mount-6803 [003] .... 81433.760945:
btrfs_qgroup_trace_subtree <-do_walk_down
(add do_walk_down to the trace here)
mount-6803 [001] .... 82124.623557: do_walk_down <-walk_down_tree
mount-6803 [001] .... 82124.623567:
btrfs_qgroup_trace_subtree <-do_walk_down
mount-6803 [006] .... 82695.241306: do_walk_down <-walk_down_tree
mount-6803 [006] .... 82695.241316:
btrfs_qgroup_trace_subtree <-do_walk_down
So 10-13 minutes per cycle.
> 11T, with highly deduped usage is really the worst scenario case for qgroup.
> Qgroup is not really good at handle hight reflinked files, nor balance.
> When they combines, it goes worse.
I'm not really understanding the use-case of qgroup if it melts down
on large systems with a shared base + individual changes.
> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
> to disable quota offline.
Ok. I was looking at just doing this to speed things up:
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51b5e2da708c..c5bf937b79f0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct
btrfs_trans_handle *trans,
parent = 0;
}
- if (need_account) {
+ if (0) {
ret = btrfs_qgroup_trace_subtree(trans, root, next,
generation, level - 1);
if (ret) {
btrfs_err_rl(fs_info,
"Error %d accounting shared subtree. Quota
is out of sync, rescan required.",
ret);
}
If I follow, this will leave me with inconsistent qgroups and a full
rescan is required. That seems an acceptable tradeoff, since it seems
like the best plan going forward is to nuke the qgroups anyway.
There's still the btrfs-transaction spin, but I'm hoping that's
related to qgroups as well.
>
> Thanks,
> Qu
Appreciate it. I was going to go with my hackjob patch to avoid any
untested rewriting - there's already an error path for "something went
wrong updating qgroups during walk_tree" so it seemed safest to take
advantage of it. I'll patch either the kernel or the btrfs programs,
whichever you think is best.