On Mon, Aug 12, 2019 at 7:48 PM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote: > > On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > > > > > From: Omar Sandoval <osandov@xxxxxx> > > > > > > We hit a the following very strange deadlock on a system with Btrfs on a > > > loop device backed by another Btrfs filesystem: > > > > > > 1. The top (loop device) filesystem queues an async_cow work item from > > > cow_file_range_async(). We'll call this work X. > > > 2. Worker thread A starts work X (normal_work_helper()). > > > 3. Worker thread A executes the ordered work for the top filesystem > > > (run_ordered_work()). > > > 4. Worker thread A finishes the ordered work for work X and frees X > > > (work->ordered_free()). > > > 5. Worker thread A executes another ordered work and gets blocked on I/O > > > to the bottom filesystem (still in run_ordered_work()). > > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow > > > work item which happens to be the recently-freed X. > > > 7. The workqueue code sees that X is already being executed by worker > > > thread A, so it schedules X to be executed _after_ worker thread A > > > finishes (see the find_worker_executing_work() call in > > > process_one_work()). > > > > > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but > > > the bottom filesystem is waiting for the top filesystem to finish, so we > > > deadlock. > > > > > > This happens because we are breaking the workqueue assumption that a > > > work item cannot be recycled while it still depends on other work. Fix > > > it by waiting to free the work item until we are done with all of the > > > related ordered work. > > > > > > P.S.: > > > > > > One might ask why the workqueue code doesn't try to detect a recycled > > > work item. It actually does try by checking whether the work item has > > > the same work function (find_worker_executing_work()), but in our case > > > the function is the same. This is the only key that the workqueue code > > > has available to compare, short of adding an additional, layer-violating > > > "custom key". Considering that we're the only ones that have ever hit > > > this, we should just play by the rules. > > > > > > Unfortunately, we haven't been able to create a minimal reproducer other > > > than our full container setup using a compress-force=zstd filesystem on > > > top of another compress-force=zstd filesystem. > > > > > > Suggested-by: Tejun Heo <tj@xxxxxxxxxx> > > > Signed-off-by: Omar Sandoval <osandov@xxxxxx> > > > > Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx> > > > > Looks good to me, thanks. > > Another variant of the problem Liu fixed back in 2014 (commit > > 9e0af23764344f7f1b68e4eefbe7dc865018b63d). > > Good point. I think we can actually get rid of those unique helpers with > this fix. I'll send some followup cleanups. Great! Thanks. -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
