I obviously can't be sure (due to obscure nature of this issue) but I think I observed similar behavior. For me it usually kicked in during scheduled de-fragmentation runs. I initially suspected it might have something to do with running defrag on files which are still opened for appending writes (through specifying the entire root subvolume folder recursively). But it kept happening after I omitted those folders. And I think defrag has nothing more to do with this other than generating a lot of IO. The SMART status is fine on all disks in the multi-device filesystem. When this happens the write buffer in the system RAM is ~full, manual sync hangs forever but some read operations are successful. Normal reboot is not possible since sync won't finish (but usually locks the system up pretty well if attempted). I didn't see this happening since I updated to 4.19.3 (if I recall correctly). Although not much time has passed. On Mon, Nov 26, 2018 at 8:14 PM Larkin Lowrey <llowrey@xxxxxxxxxxxxxxxxx> wrote: > > I started having a host freeze randomly when running a 4.18 kernel. The > host was stable when running 4.17.12. > > At first, it appeared that it was only IO that was frozen since I could > run common commands that were likely cached in RAM and that did not > touch storage. Anything that did touch storage would freeze and I would > not be able to ctrl-c it. > > I noticed today, when it happened with kernel 4.19.2, that backups were > still running and that the backup app could still read from the backup > snapshot subvol. It's possible that the backups are still able to > proceed because the accesses are all read-only and the snapshot was > mounted with noatime so the backup process never triggers a write. > > There never are any errors output to the console when this happens and > nothing is logged. When I first encountered this back in Sept. I managed > to record a few sysrq dumps and attached them to a redhat ticket. See > links below. > > https://bugzilla.redhat.com/show_bug.cgi?id=1627288 > https://bugzilla.redhat.com/attachment.cgi?id=1482177 > > I do have several VMs running that have their image files nocow'd. > Interestingly, all the VMs, except 1, seem to be able to write just > fine. The one that can't has frozen completely and is the one that > regularly generates the most IO. > > Any ideas on how to debug this further? > > --Larkin
