On Thu, Sep 26, 2019 at 02:35:45PM +0800, Qu Wenruo wrote:
> [BUG]
> There is one BUG_ON() report where a transaction is aborted during
> balance, then kernel BUG_ON() in merge_reloc_roots():
>
> void merge_reloc_roots(struct reloc_control *rc)
> {
> ...
> BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root)); <<<
> }
>
> [CAUSE]
> It's still uncertain why we can get to such situation.
> As all __add_reloc_root() calls will also link that reloc root to
> rc->reloc_roots, and in merge_reloc_roots() we cleanup rc->reloc_roots.
>
> So the root cause is still uncertain.
>
> [FIX]
> But we can still hunt down all the BUG_ON() in merge_reloc_roots().
>
> There are 3 BUG_ON()s in it:
> - BUG_ON() for read_fs_root() result
> - BUG_ON() for root->reloc_root != reloc_root case
> - BUG_ON() for the non-empty reloc_root_tree
>
> For the first two, just grab the return value and goto out tag for
> cleanup.
>
> For the last one, make it more graceful by:
> - grab the lock before doing read/write
> - warn instead of panic
> - cleanup the nodes in rc->reloc_root_tree
>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
If we're going to do these things we might as well add the ability to error
inject and add an xfstest to verify they don't break anything. The
BUG_ON(root->reloc_root != reloc_root) isn't able to be tested, but you can at
least make read_fs_root() and merge_reloc_root() and then test this patch by
triggering errors in these paths.
If you do that I'm ok with not being able to explain the leaking nodes, getting
rid of the BUG_ON()'s is good enough reason. Thanks,
Josef