On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > On 1 Nov 2018, at 6:15, David Sterba wrote:
> >
> > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > >> From: Omar Sandoval <osandov@xxxxxx>
> > >>
> > >> There's a race between close_ctree() and cleaner_kthread().
> > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > >> sees it set, but this is racy; the cleaner might have already checked
> > >> the bit and could be cleaning stuff. In particular, if it deletes
> > >> unused
> > >> block groups, it will create delayed iputs for the free space cache
> > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > >> longer running delayed iputs after a commit. Therefore, if the
> > >> cleaner
> > >> creates more delayed iputs after delayed iputs are run in
> > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > >> inode crash from the VFS.
> > >>
> > >> Fix it by parking the cleaner
> > >
> > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > we're missing a commit or clean up data structures. Messing with state
> > > of a thread would be the last thing I'd try after proving that it's
> > > not
> > > possible to fix in the logic of btrfs itself.
> > >
> > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > there. The interdependencies of thread and data structures and other
> > > subsystems cannot have loops that could not find an ordering that will
> > > not leak something.
> > >
> > > It's not a big problem if some step is done more than once, like
> > > committing or cleaning up some other structures if we know that
> > > it could create new.
> >
> > The problem is the cleaner thread needs to be told to stop doing new
> > work, and we need to wait for the work it's already doing to be
> > finished. We're getting "stop doing new work" already because the
> > cleaner thread checks to see if the FS is closing, but we don't have a
> > way today to wait for him to finish what he's already doing.
> >
> > kthread_park() is basically the same as adding another mutex or
> > synchronization point. I'm not sure how we could change close_tree() or
> > the final commit to pick this up more effectively?
>
> The idea is:
>
> cleaner close_ctree thread
>
> tell cleaner to stop
> wait
> start work
> if should_stop, then exit
> cleaner is stopped
>
> [does not run: finish work]
> [does not run: loop]
> pick up the work or make
> sure there's nothing in
> progress anymore
>
>
> A simplified version in code:
>
> set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>
> wait for defrag - could be started from cleaner but next iteration will
> see the fs closed and will not continue
>
> kthread_stop(transaction_kthread)
>
> kthread_stop(cleaner_kthread)
>
> /* next, everything that could be left from cleaner should be finished */
>
> btrfs_delete_unused_bgs();
> assert there are no defrags
> assert there are no delayed iputs
> commit if necessary
>
> IOW the unused block groups are removed from close_ctree too early,
> moving that after the threads stop AND makins sure that it does not need
> either of them should work.
>
> The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> calls plain btrfs_end_transaction that wakes up transaction ktread, so
> there would need to be an argument passed to tell it to do full commit.
Not perfect, relies on the fact that wake_up_process(thread) on a stopped
thread is a no-op, arguments would need to be added to skip that in
btrfs_delete_unused_bgs and btrfs_commit_super.
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3956,11 +3956,18 @@ void close_ctree(struct btrfs_fs_info *fs_info)
cancel_work_sync(&fs_info->async_reclaim_work);
+ if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
+ test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
+ btrfs_error_commit_super(fs_info);
+
+ kthread_stop(fs_info->transaction_kthread);
+ kthread_stop(fs_info->cleaner_kthread);
+
if (!sb_rdonly(fs_info->sb)) {
/*
- * If the cleaner thread is stopped and there are
- * block groups queued for removal, the deletion will be
- * skipped when we quit the cleaner thread.
+ * The cleaner thread is now stopped and if there are block
+ * groups queued for removal, we have to pick up the work here
+ * so there are no delayed iputs triggered.
*/
btrfs_delete_unused_bgs(fs_info);
@@ -3969,13 +3976,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
btrfs_err(fs_info, "commit super ret %d", ret);
}
- if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
- test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
- btrfs_error_commit_super(fs_info);
-
- kthread_stop(fs_info->transaction_kthread);
- kthread_stop(fs_info->cleaner_kthread);
-
ASSERT(list_empty(&fs_info->delayed_iputs));
set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);