I have a btrfs filesystem which I want to scrub. This is a multi-TB filesystem and will take well over 24 hours to scrub. Unfortunately, the scrub turns out to be quite intrusive into the system (even when making sure it is very low priority for ionice and nice). Operations on other disks run excessively slowly, causing timeouts on important actions like mail delivery (causing bounces). So, I break it up. I run it for some interval (hours), with the time-critical services stopped. Then I cancel the scrub and let mail delivery run for a while. Then I stop mail again and resume the scrub for another interval, etc. This works and solves the mail bounce problem. However, after a few cancel/resume cycles, the scrub terminates. No errors are reported but one of the resumes will just immediately terminate claiming the scrub is done. It isn't. Nowhere near. The disk being scrubbed is in use during all this. It doesn't get a heavy load but it is my main backup disk and various backups happen, some of them involving snapshots being created and deleted. Glancing at the use of the ioctl in the btrfs-progs code, I assume the resume is using the last_physical from the last run as the start for the next. Does that break if the filesystem has changed and that is no longer a used block or something? If so, I think that makes resume useless. If this is not expected behaviour I will do more work to analyse and reproduce. Graham
