On 11/5/20 2:46 pm, Chris Murphy wrote:
> I also wonder whether the socket that Graham mentions, could get in
> some kind of stuck or confused state due to sleep/wake cycle? My case,
> NVMe, is maybe not the best example because that's just PCIe. In your
> case it's real drives, so it's SCSI, block, and maybe libata and other
> things.
Could be. When I start a new scrub and suspend the system, after a
resume further attempts to run "btrfs scrub status -dR /home" result in
the following:
NOTE: Reading progress from status file
UUID: 85069ce9-be06-4c92-b8c1-8a0f685e43c6
scrub device /dev/sda (id 1) status
Scrub started: Wed May 13 16:10:12 2020
Status: running
Duration: 0:00:22
data_extents_scrubbed: 0
tree_extents_scrubbed: 29238
data_bytes_scrubbed: 0
tree_bytes_scrubbed: 479035392
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 0
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 0
scrub device /dev/sdb (id 2) status
Scrub started: Wed May 13 16:10:12 2020
Status: running
Duration: 0:00:23
data_extents_scrubbed: 0
tree_extents_scrubbed: 27936
data_bytes_scrubbed: 0
tree_bytes_scrubbed: 457703424
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 0
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 0
So it appears that the socket connection to the kernel is not able to be
reestablished after the resume from suspend-to-RAM. Interestingly, if I
then manually run "btrfs scrub cancel /home" and "btrfs scrub resume -c3
/home" then the status reports start working again. It fails only when
"btrfs scrub resume -c3 /home" is run from the script
"/usr/lib/systemd/system-sleep/btrfs-scrub" as follows:
#!/bin/sh
case $1/$2 in
pre/*)
btrfs scrub cancel /home
;;
post/*)
sleep 1
btrfs scrub resume -c3 /home
;;
esac
Without the sleep, it does not resume the scrub. A longer sleep (5
seconds) does not resolve the issue with the status reports.
Maybe this is some kind of systemd problem... :(
Thanks,
Andrew