On 11.09.19 г. 9:45 ч., David Newall wrote: > Hi All, > > I might have misunderstood how to report a problem. I registered for > bugzilla and reported a bug > (https://bugzilla.kernel.org/show_bug.cgi?id=204757), but, perhaps I > should have sent this message to this mailing list, first. My apologies > if I bungled it. > > I've been trying to track down a problem, intermittently, for a long > time, and now need to reach out for advice. I apologise in advance for > the quality of this report, which I feel includes more detail than > needed, yet may be missing what's important. I'm trying my best. > > The brief summary is that my system hangs during SSH login while a > backup is in progress. Sshd uses PAM authentication. The problem seems > to be related to mounts as df and mount also hang. > > The longer details are: I'm running Ubuntu 16.04.5 on a 64-bit VM under > kvm. I backup data using the following steps: > > 1. Take an LVM2 snapshot of the (non-root) ext2 file-system mounted as > /data; > 2. Mount a btrfs file system as /backup; > 2. Mount the snapshot over an empty directory (may be subvolume; does it > make a difference?) on /backup/snapshot; > 3. Rsync the snapshot (with --archive --one-file-system --hard-links > --inplace --numeric-ids --delete) to a subvolume /backup/data (thus it > always contains /data as at last backup); > 4. Take btrfs subvolume snapshot of /backup/data; > 5. Unmount /backup/snapshot and /backup. > > By the time I get called, SSH logins via PAM hang (but complete > "immediately" if I re-configure sshd for UsePAM no). Sessions which are > still logged in seem unaffected, except df and mount both hang. I don't > know what else hangs. > > During all of these steps, the /data is almost static, maybe even be > completely static. > > I've queried my user, carefully, to determine the exact step where it > starts to hang, and am 90% confident in her answer, which indicates that > the hang-condition starts during rsync. > > Processes that were hanging complete normally when subvolume snapshot > finishes. > > There's a chance that processes complete when the snapshot or btrfs > file-system is unmounted, but I think it's before then because I've > tried running each step by hand, was unable to reproduce the problem, > probably because the amount of data to rsync in real-use is much larger > than I tried writing during that test. At any rate, during that test I > could log in between and during each step of the procedure. > > The only messages in dmesg are "mounting ext2 file system using the ext4 > subsystem" and "mounted filesystem without journal. Opts: (null)", which > sounds right as I use "mount" instead of "mount -text2". > > When I tried running df under strace, strace's output was: > > open("/proc/self/mountinfo", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > read(3, "18 24 0:17 / /sys rw,nosuid,node"..., 1024) = 1024 > read(3, "ystemd/systemd-cgroups-agent,nam"..., 1024) = 1024 > read(3, "t rw,nosuid,nodev,noexec,relatim"..., 1024) = 1024 > read(3, > > After the subvolume snapshot completed, strace continued producing output: > > "fs lxcfs rw,user_id=0,group_id=0"..., 1024) = 624 > --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=28055, > si_uid=1000} --- > read(3, "", 1024) = 0 > lseek(3, 0, SEEK_CUR) = 3696 > close(3) = 0 > > I think the SIGCONT was because I suspended the parent, strace, using > Ctrl-Z. > > I could just leave sshd doing non-PAM authentication but I think that's > the wrong approach. How do I zero in on this problem? When the issue manifests do : echo w > /proc/sysrq-trigger This should provide a backtrace for all threads which are currently in uninterruptible sleep. If it's a deadlock due to btrfs being stuck we should see it. Also provide your exact kernel version. > > Thanks, > > David > >
