I think I'm also running into this kind of balancing loop. I'm seeing thousands of these repeating messages in syslog: Apr 29 11:56:49 rockpro64 kernel: [12044.645247] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Apr 29 11:56:49 rockpro64 kernel: [12045.134649] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Apr 29 11:56:50 rockpro64 kernel: [12045.648773] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Apr 29 11:56:50 rockpro64 kernel: [12046.172767] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Apr 29 11:56:51 rockpro64 kernel: [12046.666079] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Apr 29 11:56:51 rockpro64 kernel: [12047.199652] BTRFS info (device dm-3): found 36652 extents, stage: move data extents Can I help you to debug this in any way? I'm not too familiar with the internal btrfs tools. Best regards, Sebastian Am Mi., 29. Apr. 2020 um 07:36 Uhr schrieb Qu Wenruo <quwenruo.btrfs@xxxxxxx>: > > > > On 2020/4/28 下午10:51, Zygo Blaxell wrote: > > On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote: > >> > >> > >> On 2020/4/28 下午12:55, Zygo Blaxell wrote: > >>> On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote: > >>>> > >>>> > >>>> On 2020/4/12 上午5:14, Zygo Blaxell wrote: > >>>>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops > >>>>> in balance and device shrink/remove: > >>>>> > >>>>> [Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers > >>>>> [Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers > >>>>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers > >>>>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers > >>>>> [Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers > >>>>> > >>>>> This is a block group while it's looping, as seen by python-btrfs: > >>>>> > >>>>> # share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/ > >>>>> block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87 > >> [...] > >>>>> > >>>>> All of the extent data backrefs are removed by the balance, but the > >>>>> loop keeps trying to get rid of the shared data backrefs. It has > >>>>> no effect on them, but keeps trying anyway. > >>>> > >>>> I guess this shows a pretty good clue. > >>>> > >>>> I was always thinking about the reloc tree, but in your case, it's data > >>>> reloc tree owning them. > >>> > >>> In that case, yes. Metadata balances loop too, in the "move data extents" > >>> stage while data balances loop in the "update data pointers" stage. > >> > >> Would you please take an image dump of the fs when runaway balance happened? > >> > >> Both metadata and data block group loop would greatly help. > > > > There's two problems with this: > > > > 1) my smallest test filesystems have 29GB of metadata, > > > > 2) the problem is not reproducible with an image. > > OK, then when such loop happened, what we need is: > - Which block group is looping > Obviously > > - The extent tree dump of that block group > You have done a pretty good job on that. > > - Follow the backref to reach the root > In balance case, the offending backref should always using parent > bytenr, so you need to step by step to reach the root, which should > have backref points back to itself. > > - Root tree dump of the looping fs. > To find out which reloc tree it belongs to. > And show the ref number of that reloc tree. > > > > > I've tried using VM snapshots to put a filesystem into a reproducible > > looping state. A block group that loops on one boot doesn't repeatably > > loop on another boot from the same initial state; however, once a > > booted system starts looping, it continues to loop even if the balance > > is cancelled, and I restart balance on the same block group or other > > random block groups. > > > > I have production filesystems with tens of thousands of block groups > > and almost all of them loop (as I said before, I cannot complete any > > RAID reshapes with 5.1+ kernels). They can't _all_ be bad. > > > > Cancelling a balance (usually) doesn't recover. Rebooting does. > > The change that triggered this changes the order of operations in > > the kernel. That smells like a runtime thing to me. > > > >>>> In that case, data reloc tree is only cleaned up at the end of > >>>> btrfs_relocate_block_group(). > >>>> > >>>> Thus it is never cleaned up until we exit the balance loop. > >>>> > >>>> I'm not sure why this is happening only after I extended the lifespan of > >>>> reloc tree (not data reloc tree). > >>> > >>> I have been poking around with printk to trace what it's doing in the > >>> looping and non-looping cases. It seems to be very similar up to > >>> calling merge_reloc_root, merge_reloc_roots, unset_reloc_control, > >>> btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols, > >>> btrfs_free_block_rsv. In the looping cases, everything up to those > >>> functions seems the same on every loop except the first one. > >>> > >>> In the non-looping cases, those functions do something different than > >>> the looping cases: the extents disappear in the next loop, and the > >>> balance finishes. > >>> > >>> I haven't figured out _what_ is different yet. I need more cycles to > >>> look at it. > >>> > >>> Your extend-the-lifespan-of-reloc-tree patch moves one of the > >>> functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different > >>> place in the call sequence. It was in merge_reloc_roots before the > >>> transaction commit, now it's in relocate_block_group after transaction > >>> commit. My guess is that the problem lies somewhere in how the behavior > >>> of these functions has been changed by calling them in a different > >>> sequence. > >>> > >>>> But anyway, would you like to give a try of the following patch? > >>>> https://patchwork.kernel.org/patch/11511241/ > >>> > >>> I'm not sure how this patch could work. We are hitting the found_extents > >>> counter every time through the loop. It's returning thousands of extents > >>> each time. > >>> > >>>> It should make us exit the the balance so long as we have no extra > >>>> extent to relocate. > >>> > >>> The problem is not that we have no extents to relocate. The problem is > >>> that we don't successfully get rid of the extents we do find, so we keep > >>> finding them over and over again. > >> > >> That's very strange. > >> > >> As you can see, for relocate_block_group(), it will cleanup reloc trees. > >> > >> This means either we have reloc trees in use and not cleaned up, or some > >> tracing mechanism is not work properly. > > > > Can you point out where in the kernel that happens? If we throw some > > printks at it we might see something. > > The merge happens at merge_reloc_roots(). > The cleanup happens at clean_dirty_subvols(). > > In theory, all reloc trees will either marked as orphan (due to new > changes in original subvolume) or merged in merge_reloc_roots(), and > queued for cleanup. > Then in clean_dirty_subvolumes() reloc trees all get cleaned up. > > If there are any reloc tree remaining after clean_dirty_subvolumes() > then it could be the cause of such loop. > > Thanks, > Qu > > > >> Anyway, if image dump with the dead looping block group specified, it > >> would provide good hint to this long problem. > >> > >> Thanks, > >> Qu > >> > >>> > >>> In testing, the patch has no effect: > >>> > >>> [Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> [Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers > >>> > >>> The above is the tail end of 3320 loops on a single block group. > >>> > >>> I switched to a metadata block group and it's on the 9th loop: > >>> > >>> # btrfs balance start -mconvert=raid1 /media/testfs/ > >>> [Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents > >>> [Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> [Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents > >>> > >>> > >>>> Thanks, > >>>> Qu > >>>> > >>>>> > >>>>> This is "semi-infinite" because it is possible for the balance to > >>>>> terminate if something removes those 29 extents (e.g. by looking up the > >>>>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs > >>>>> fi defrag' will reduce the number of inline shared data backref objects. > >>>>> When it's reduced all the way to zero, balance starts up again, usually > >>>>> promptly getting stuck on the very next block group. If the _only_ > >>>>> thing running on the filesystem is balance, it will not stop looping. > >>>>> > >>>>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc > >>>>> tree deletion after merge_reloc_roots" as the first commit where the > >>>>> balance loops can be reproduced. > >>>>> > >>>>> I tested with commit 59b2c371052c "btrfs: check commit root generation > >>>>> in should_ignore_root" as well as the rest of misc-next, but the balance > >>>>> loops are still easier to reproduce than to avoid. > >>>>> > >>>>> Once it starts happening on a filesystem, it seems to happen very > >>>>> frequently. It is not possible to reshape a RAID array of more than a > >>>>> few hundred GB on kernels after 5.0. I can get maybe 50-100 block groups > >>>>> completed in a resize or balance after a fresh boot, then balance gets > >>>>> stuck in loops after that. With the fast balance cancel patches it's > >>>>> possibly to recover from the loop, but futile, since the next balance > >>>>> will almost always also loop, even if it is passed a different block > >>>>> group. I've had to downgrade to 5.0 or 4.19 to complete any RAID > >>>>> reshaping work. > >>>>> > >>>> > >>> > >>> > >>> > >> > > > > > > >
