On Wed, May 22, 2019 at 04:33:11PM +0800, Qu Wenruo wrote:
> [BUG]
> When a fs has orphan reloc tree along with unfinished balance:
> ...
> item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
> generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
> lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
> uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
> item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
> temporary item objectid BALANCE offset 0
> balance status flags 14
>
> Then at mount time, we can hit the following kernel BUG_ON():
> BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/relocation.c:1413!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G O 5.2.0-rc1-custom #15
> RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
> Call Trace:
> btrfs_init_reloc_root+0x96/0xb0 [btrfs]
> record_root_in_trans+0xb2/0xe0 [btrfs]
> btrfs_record_root_in_trans+0x55/0x70 [btrfs]
> select_reloc_root+0x7e/0x230 [btrfs]
> do_relocation+0xc4/0x620 [btrfs]
> relocate_tree_blocks+0x592/0x6a0 [btrfs]
> relocate_block_group+0x47b/0x5d0 [btrfs]
> btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
> btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
> btrfs_balance+0x864/0xfa0 [btrfs]
> balance_kthread+0x3b/0x50 [btrfs]
> kthread+0x123/0x140
> ret_from_fork+0x27/0x50
>
> [CAUSE]
> In btrfs, reloc trees are used to record swapped tree blocks during
> balance.
> Reloc tree either get merged (replace old tree blocks of its parent
> subvolume) in next transaction if its ref is 1 (fresh).
> Or is already merged and will be cleaned up if its ref is 0 (orphan).
>
> After commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion
> after merge_reloc_roots"), reloc tree cleanup is delayed until one block
> group is balanced.
>
> Since fresh reloc roots are recorded during merge, as long as there
> is no power loss, those orphan reloc roots converted from fresh ones are
> handled without problem.
>
> However when power loss happens, orphan reloc roots can be recorded
> on-disk, thus at next mount time, we will have orphan reloc roots from
> on-disk data directly, and ignored by clean_dirty_subvols() routine.
>
> Then when background balance starts to balance another block group, and
> needs to create new reloc root for the same root, btrfs_insert_item()
> returns -EEXIST, and trigger that BUG_ON().
>
> [FIX]
> For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
> all reloc roots no matter orphan or not, can be cleaned up properly and
> avoid above BUG_ON().
>
> And to cooperate with above change, clean_dirty_subvols() will check if
> the queued root is a reloc root or a subvol root.
> For a subvol root, do the old work, and for a orphan reloc root, clean it
> up.
>
> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
I've hit that BUG_ON too on my work machine too. Thanks for the fix,
queued for 5.2-rc.