On 3/26/20 11:36 AM, Zygo Blaxell wrote:
On Wed, Mar 25, 2020 at 10:12:30AM -0400, Josef Bacik wrote:
On 3/25/20 9:51 AM, David Sterba wrote:
On Fri, Mar 13, 2020 at 05:12:15PM -0400, Josef Bacik wrote:
While debugging Zygo's delayed ref problems it was clear there were a bunch of
cases that we're running delayed refs when we don't need to be, and they result
in a lot of weird latencies.
Each patch has their individual explanations. But the gist of it is we run
delayed refs in a lot of arbitrary ways that have just accumulated throughout
the years, so clean up all of these so we can have more consistent performance.
It would be fine to remove the delayed refs being run from so many
places but I vaguely remember some patches adding them with "we have to
run delayed refs here or we will miss something and that would be a
corruption". The changelogs in patches from 3 on don't point out any
specific problems and I miss some reasoning about correctness, ideally
for each line of btrfs_run_delayed_refs removed.
As a worst case I really don't want to get to a situation where we start
getting reports that something broke because of the missing delayed
refs, followed by series of "oh yeah I forgot we need it here, add it
back".
Yeah I went through and checked each of these spots to see why we had them.
A lot of it had to do with how poorly delayed refs were run previously. You
could end up with weird ordering cases and missing our flags.
These problems are all gone now, we no longer have to run delayed refs to
work around ordering weirdness because I fixed all of those problems. Now
these are just old relics of the past that need to die. The only case where
I didn't touch them is for qgroups, likely because it still matters for the
before/after lookups there.
But everywhere else it was working around some deficiency in how we ran
delayed refs, either in the ordering issues or space related. Both those
problems no longer exist, so we can drop these workarounds.
The branch with this patchset is in for-next but I'm still not
comfortable with adding it to misc-next as I can't convince myself it's
safe, so more reviews are welcome.
Yeah I'm targeting the merge window after the upcoming one with these,
there's still a lot more testing I want to get done. I mostly threw them up
because they were no longer blowing up constantly for Zygo, and I wanted
Filipe to get an early look at them. Thanks,
No longer blowing up _constantly_, but there was definitely a 2-3 day
cadence between blowups last time I rebased. Test runs were ending in
splats due to KASAN UAF bugs and bad unlock balances. It doesn't seem
to be corrupting on-disk metadata, but my test VMs can't get anywhere
close to a week uptime under the full stress load yet.
I'd like to keep a test VM pointed at this as it makes it way upstream.
It's an important set of changes, but it has a high regression risk.
There are some big changes here, and that's going to expose all the gaps
in developers' knowledge of how stuff really works.
Do I just keep rebasing on for-next-<date>?
I think so, I'm not sure what Dave has merged so far. My own long term tests
have uncovered a few bugs that I've been busy running down. Once my long term
tests no longer fall over I'm going to rebase everything onto the most recent
devel branch and we can go from there. Hopefully I'll have this last corner run
down by tomorrow or early next week. Thanks,
Josef