On 4/30/20 19:47, Zygo Blaxell wrote: > > If it keeps repeating "found 1115 extents" over and over (say 5 or > more times) then you're hitting the balance looping bug in kernel 5.1 > and later. Every N block groups (N seems to vary by user, I've heard > reports from 3 to over 6000) the kernel will get stuck in a loop and > will need to reboot to recover. Even if you cancel the balance, it will > just loop again until rebooted, and there's no cancel for device delete > so if you start looping there you can just skip directly to the reboot. > For a non-trivial filesystem the probability of successfully deleting > or resizing a device is more or less zero. This does not seem to be happening. Each message is for a different block group with a different number of clusters. The device remove *is* making progress, just very very slowly. I'm almost down to just 2TB left. Woot! If I ever have to do this again, I'll insert bcache and a big SSD between btrfs and my devices. The slowness here has to be due to the (spinning) disk I/O being highly fragmented and random. I've checked, and none of my drives (despite their large sizes) are shingled, so that's not it. The 6 TB units have 128 MB caches and the 16 TB have 256 MB caches. I've never understood *exactly* what a hard drive internal cache does. I see little sense in a LRU cache just like the host's own buffer cache since the host has far more RAM. I do know they're used to reorder operations to reduce seek latency, though this can be limited by the need to fence writes to protect against a crash. I've wondered if they're also used on reads to reduce rotational latency by prospectively grabbing data as soon as the heads land on a cylinder. How big is a "cylinder'' anyway? The inner workings of hard drives have become steadily more opaque over the years, which makes it difficult to optimize their use. Kinda like CPUs, actually. Last time I really tuned up some tight code, I found that using vector instructions and avoiding branch mispredictions made a big difference but nothing else seemed to matter at all. > > There is no fix for that regression yet. Kernel 4.19 doesn't have the > regression and does have other relevant bug fixes for balance, so it > can be used as a workaround. I'm running 4.19.0-8-rt-amd64, the current real-time kernel in Debian 'stable'. Phil
