Re: Scrub racing with btrfs workers on umount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 20, 2012 at 06:34:13AM -0700, Jan Schmidt wrote:
> I found a race condition in the way scrub interacts with btrfs workers:
> 
> Please carefully distinguish "workers" as struct btrfs_workers from "worker" as
> struct btrfs_worker_thread.
> 
> Process A: umount
> 	Process B: scrub-workers
> 		Process C: genwork-worker
> 
> A: close_ctree()
> A: scrub_cancel()
> 	B: scrub_workers_put()
> 	B: btrfs_stop_workers()
> 	B: -> lock(workers->lock)
> 	B: -> splice idle worker list into worker list
> 	B: -> while (worker list)
> 	B: ->	worker = delete from worker list
> 	B: ->	unlock(workers->lock)
> 	B: ->	kthread_stop(worker)
> 		C: __btrfs_start_workers()
> 		C: -> kthread_run starts new scrub-worker
> 		C: -> lock(workers->lock) /* from scrub-workers */
> 		C: -> add to workers idle list
> 		C: -> unlock(workers->lock)
> 	B: ->	lock(workers->lock)
> A: stop genwork-worker
> 
> In that situation, the scrub-workers have one idle worker but return from
> btrfs_stop_workers(). There are several strategies to fix this:
> 
> 1. Let close_ctree call scrub_cancel() after stopping the generic worker.
> 
> 2. Let btrfs_stop_workers be more careful on insertions to the idle list, like:
> 
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 58b7d14..2523781 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -431,6 +431,8 @@ void btrfs_stop_workers(struct btrfs_workers *workers)
>                         kthread_stop(worker->task);
>                 spin_lock_irq(&workers->lock);
>                 put_worker(worker);
> +               /* we dropped the lock above, check for new idle workers */
> +               list_splice_init(&workers->idle_list, &workers->worker_list);
>         }
>         spin_unlock_irq(&workers->lock);
>  }
> 
> 3. Find out why __btrfs_start_workers() gets called in the above scenario at all.
> 
> Any thoughts? Otherwise, I'm going for option 3 and will call back.

I'd go for a mixture of 1 and 3.  My first guess is that we're getting
endio events for scrub reads and that is kicking the scrub state
machine.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux