On Thu, Nov 01, 2018 at 08:23:17AM -0700, Omar Sandoval wrote:
> On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > >
> > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > >> From: Omar Sandoval <osandov@xxxxxx>
> > > >>
> > > >> There's a race between close_ctree() and cleaner_kthread().
> > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > >> sees it set, but this is racy; the cleaner might have already checked
> > > >> the bit and could be cleaning stuff. In particular, if it deletes
> > > >> unused
> > > >> block groups, it will create delayed iputs for the free space cache
> > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > >> longer running delayed iputs after a commit. Therefore, if the
> > > >> cleaner
> > > >> creates more delayed iputs after delayed iputs are run in
> > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > >> inode crash from the VFS.
> > > >>
> > > >> Fix it by parking the cleaner
> > > >
> > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > we're missing a commit or clean up data structures. Messing with state
> > > > of a thread would be the last thing I'd try after proving that it's
> > > > not
> > > > possible to fix in the logic of btrfs itself.
> > > >
> > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > there. The interdependencies of thread and data structures and other
> > > > subsystems cannot have loops that could not find an ordering that will
> > > > not leak something.
> > > >
> > > > It's not a big problem if some step is done more than once, like
> > > > committing or cleaning up some other structures if we know that
> > > > it could create new.
> > >
> > > The problem is the cleaner thread needs to be told to stop doing new
> > > work, and we need to wait for the work it's already doing to be
> > > finished. We're getting "stop doing new work" already because the
> > > cleaner thread checks to see if the FS is closing, but we don't have a
> > > way today to wait for him to finish what he's already doing.
> > >
> > > kthread_park() is basically the same as adding another mutex or
> > > synchronization point. I'm not sure how we could change close_tree() or
> > > the final commit to pick this up more effectively?
> >
> > The idea is:
> >
> > cleaner close_ctree thread
> >
> > tell cleaner to stop
> > wait
> > start work
> > if should_stop, then exit
> > cleaner is stopped
> >
> > [does not run: finish work]
> > [does not run: loop]
> > pick up the work or make
> > sure there's nothing in
> > progress anymore
> >
> >
> > A simplified version in code:
> >
> > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> >
> > wait for defrag - could be started from cleaner but next iteration will
> > see the fs closed and will not continue
> >
> > kthread_stop(transaction_kthread)
> >
> > kthread_stop(cleaner_kthread)
> >
> > /* next, everything that could be left from cleaner should be finished */
> >
> > btrfs_delete_unused_bgs();
> > assert there are no defrags
> > assert there are no delayed iputs
> > commit if necessary
> >
> > IOW the unused block groups are removed from close_ctree too early,
> > moving that after the threads stop AND makins sure that it does not need
> > either of them should work.
> >
> > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > there would need to be an argument passed to tell it to do full commit.
>
> It's too fragile to run around in the filesystem with these threads
> freed. We can probably make it now, but I'm worried that we'll add
> another wakeup somewhere and blow up.
So functions that are called from close_ctree will be careful to either
not wake up or check before trying to wake them up. We can also wrap the
wakeups like
wake_up_cleaner(fs_info) {
if (!fs_info->state & CLOSING)
wake_up_process(cleaner_kthread)
else
BUG_ON
}
Same for kthread.