On Thu, Apr 30, 2020 at 11:31 AM Phil Karn <karn@xxxxxxxx> wrote: > > Any comments on my message about btrfs drive removals being extremely slow? It could be any number of things. Each drive has at least 3 partitions so what else is on these drives? Are those other partitions active with other things going on at the same time? How are the drives connected to the computer? Direct SATA/SAS connection? Via USB enclosures? How many snapshots? Are quotas enabled? There's nothing in dmesg for 5 days? Anything for the most recent hour? i.e. journalctl -k --since=-1h It's an old kernel by this list's standards. Mostly this list is active development on mainline and stable kernels, not LTS kernels which - you might have found a bug. But there's thousands of changes throughout the storage stack in the kernel since then, thousands just in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+ month development difference. It's pretty much just luck if an upstream Btrfs developer sees this and happens to know why it's slow and that it was fixed in X kernel version or maybe it's a really old bug that just hasn't yet gotten a good enough bug report still, and hasn't been fixed. That's why it's common advice to "try with a newer kernel" because the problem might not happen, and if it does, then chances are it's a bug. > I started the operation 5 days ago, and of right now I still have 2.18 > TB to move off the drive I'm trying to replace. I think it started > around 3.5 TB. Issue sysrq+t and post the output from 'journalctl -k --since=-10m' in something like pastebin or in a text file on nextcloud/dropbox etc. It's probably too big to email and usually the formatting gets munged anyway and is hard to read. Someone might have an idea why it's slow from sysrq+t but it's a long shot. > Should I reboot degraded without this drive and do a "remove missing" > operation instead? I'm willing to take the risk of losing another drive > during the operation if it'll speed this up. It wouldn't be so bad if it > weren't slowing my filesystem to a crawl for normal stuff, like reading > mail. If there's anything important on this file system, you should make a copy now. Update backups. You should be prepared to lose the whole thing before proceeding further. Next, disable the write cache on all the drives. This can be done with hdparm -W (cap W, lowercase w is dangerous, see man page). This should improve the chance of the file system on all drives being consistent if you have to force reboot - i.e. the reboot might hang so you should be prepared to issue sysrq+s followed by sysrq+b. Better than power reset. We don't know what we don't know so it's a guess what the next step is. While powered off you can remove devid 2, the device you want removed. And first see if you can mount -o ro,degraded and check dmesg, and see if things pass a basic sanity test for reading. Then remount rw, and try to remove the missing device. It might go faster to just rebuild the missing data from the single copies left? But there's not much to go on. Boot, leave all drives connected, make sure the write caches are disabled, then make sure there's no SCT ERC mismatch, i.e. https://raid.wiki.kernel.org/index.php/Timeout_Mismatch And then do a scrub with all the drives attached. And then assess the next step only after that completes. It'll either fix something or not. You can do this same thing with kernel 4.19. It should work. But until the health of the file system is known, I can't recommend doing any device replacements or removals. It must be completely healthy first. I personally would only do the device removal (either remove while still connected or remove while missing) with 5.6.8 or 5.7rc3 because if I have a problem, I'm reporting it on this list as a bug. With 4.19 it's just too old I think for this list, it's pure luck if anyone knows for sure what's going on. -- Chris Murphy
