On Thu, Jan 23, 2020 at 03:33:02PM -0500, Josef Bacik wrote:
> There exists a deadlock with range_cyclic that has existed forever. If
> we loop around with a bio already built we could deadlock with a writer
> who has the page locked that we're attempting to write but is waiting on
> a page in our bio to be written out. The task traces are as follows
>
> PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
> #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
> #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
> #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
> #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
> #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
> #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
> #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
> #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
> #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
> #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
>
> PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
> "aio-dio-invalid"
> #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
> #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
> #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
> #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
> #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
> #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
> #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
> #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
> #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
> #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
>
> I used drgn to find the respective pages we were stuck on
>
> page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
> page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
>
> As you can see the kworker is waiting for bit 0 (PG_locked) on index
> 7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
> 8148. aio-dio-invalid has 7680, and the kworker epd looks like the
> following
>
> crash> struct extent_page_data ffffc900297bbbb0
> struct extent_page_data {
> bio = 0xffff889f747ed830,
> tree = 0xffff889eed6ba448,
> extent_locked = 0,
> sync_io = 0
> }
>
> and using drgn I walked the bio pages looking for page
> 0xffffea00fbfc7500 which is the one we're waiting for writeback on
>
> bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
> for i in range(0, bio.bi_vcnt.value_()):
> bv = bio.bi_io_vec[i]
> if bv.bv_page.value_() == 0xffffea00fbfc7500:
> print("FOUND IT")
>
> which validated what I suspected.
>
> The fix for this is simple, flush the epd before we loop back around to
> the beginning of the file during writeout.
>
> Fixes: b293f02e1423 ("Btrfs: Add writepages support")
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
Added to misc-next, thanks.