On Sun, Feb 09, 2020 at 10:00:34AM +0100, Martin Steigerwald wrote: > Zygo Blaxell - 09.02.20, 01:43:07 CET: > > Up to that point, a few processes have been blocked for up to 5 hours, > > but this is not unusual on a big filesystem given #1. Usually > > processes that read the filesystem (e.g. calling lstat) are not > > blocked, unless they try to access a directory being modified by a > > process that is blocked. lstat() being blocked is unusual. > > This is really funny, cause what you consider not being unusual, I'd > consider a bug or at least a huge limitation. > > But in a sense I never really got that processed can be stuck in > uninterruptible sleep on Linux or Unix *at all*. Such a situation > without giving a user at least the ability to end it by saying "I don't > care about the data that process is to write, let me remove it already" > for me is a major limitation to what appears to be kind of specific to > the UNIX architecture or at least the way the Linux virtual memory > manager is working. > That written I may be completely ignorant of something very important > here and some may tell me it can't be any other way for this and that > reason. Currently I still think it can. The process in uninterruptible sleep is waiting for the filesystem code to finish whatever it's doing so the in-memory and on-disk structures end in a consistent state. If whatever it's doing is "waiting for a lock held by some other thread doing an expensive thing", it can block for a long time. We can't simply abort the kernel thread here, which is why it's uninterruptible wait (*). Generic interruption would need to unwind the kernel stack all the way back to userspace, reverting all changes made to the filesystem's internal data structures as we go, without tripping over the need for some other lock in the process, and without introducing horrible new regressions. In theory we can interrupt any kernel thread at any time--that happens naturally whenever there's a BUG() or power failure, for instance--but the effect on all the other threads that might be running is pretty painful. If you add a level of indirection--e.g. run the btrfs code in a VM and access it via a network or virtio client--then we can interrupt the client, but the server ends up having to finish whatever operation the client requested anyway, so the client just gets to immediately hang waiting for the server on its next call. > And even if uninterruptible sleep can still happen cause it is really > necessary, five hours is at least about five hours minus probably a minute > or so too long. Yes it would be nice if btrfs could avoid overcommitting itself so badly, but that's a somewhat older and larger-scoped bug. > Ciao, > -- > Martin > > (*) well we could, if all the filesystem code was written that way. Patches welcome!
Attachment:
signature.asc
Description: PGP signature
