On Wed, May 13, 2020 at 12:49 AM Andrew Pam <andrew@xxxxxxxxxxxxxx> wrote: > > On 11/5/20 2:46 pm, Chris Murphy wrote: > > I also wonder whether the socket that Graham mentions, could get in > > some kind of stuck or confused state due to sleep/wake cycle? My case, > > NVMe, is maybe not the best example because that's just PCIe. In your > > case it's real drives, so it's SCSI, block, and maybe libata and other > > things. > > Could be. When I start a new scrub and suspend the system, after a > resume further attempts to run "btrfs scrub status -dR /home" result in > the following: > > NOTE: Reading progress from status file > UUID: 85069ce9-be06-4c92-b8c1-8a0f685e43c6 > scrub device /dev/sda (id 1) status > Scrub started: Wed May 13 16:10:12 2020 > Status: running > Duration: 0:00:22 > data_extents_scrubbed: 0 > tree_extents_scrubbed: 29238 > data_bytes_scrubbed: 0 > tree_bytes_scrubbed: 479035392 > read_errors: 0 > csum_errors: 0 > verify_errors: 0 > no_csum: 0 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 0 > last_physical: 0 > scrub device /dev/sdb (id 2) status > Scrub started: Wed May 13 16:10:12 2020 > Status: running > Duration: 0:00:23 > data_extents_scrubbed: 0 > tree_extents_scrubbed: 27936 > data_bytes_scrubbed: 0 > tree_bytes_scrubbed: 457703424 > read_errors: 0 > csum_errors: 0 > verify_errors: 0 > no_csum: 0 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 0 > last_physical: 0 > > So it appears that the socket connection to the kernel is not able to be > reestablished after the resume from suspend-to-RAM. Interestingly, if I > then manually run "btrfs scrub cancel /home" and "btrfs scrub resume -c3 > /home" then the status reports start working again. It fails only when > "btrfs scrub resume -c3 /home" is run from the script > "/usr/lib/systemd/system-sleep/btrfs-scrub" as follows: > > #!/bin/sh > > case $1/$2 in > pre/*) > btrfs scrub cancel /home > ;; > post/*) > sleep 1 > btrfs scrub resume -c3 /home > ;; > esac > > Without the sleep, it does not resume the scrub. A longer sleep (5 > seconds) does not resolve the issue with the status reports. > > Maybe this is some kind of systemd problem... :( Oof. So possibly two bugs. Well, you could post an inquiry about it to systemd-devel@. And then maybe try to reproduce with something that has systemd-245 to see if it makes a difference. I ran into something that looked like races in 244 with a rust generator I was testing, that didn't happen in 245. *shrug* But I'm not sure testing wise how to isolate the systemd from the socket questions. -- Chris Murphy
