On Mon, Jul 29, 2019 at 09:37:10AM +0100, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> The fiemap handler locks a file range that can have unflushed delalloc,
> and after locking the range, it tries to attach to a running transaction.
> If the running transaction started its commit, that is, it is in state
> TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
> flushoncommit option or the transaction is creating a snapshot for the
> subvolume that contains the file that fiemap is operating on, we end up
> deadlocking. This happens because fiemap is blocked on the transaction,
> waiting for it to complete, and the transaction is waiting for the flushed
> dealloc to complete, which requires locking the file range that the fiemap
> task already locked. The following stack traces serve as an example of
> when this deadlock happens:
>
> (...)
> [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [404571.515956] Call Trace:
> [404571.516360] ? __schedule+0x3ae/0x7b0
> [404571.516730] schedule+0x3a/0xb0
> [404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs]
> [404571.517465] ? remove_wait_queue+0x60/0x60
> [404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs]
> [404571.518202] normal_work_helper+0xea/0x530 [btrfs]
> [404571.518566] process_one_work+0x21e/0x5c0
> [404571.518990] worker_thread+0x4f/0x3b0
> [404571.519413] ? process_one_work+0x5c0/0x5c0
> [404571.519829] kthread+0x103/0x140
> [404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70
> [404571.520565] ret_from_fork+0x3a/0x50
> [404571.520915] kworker/u8:6 D 0 31651 2 0x80004000
> [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
> (...)
> [404571.537000] fsstress D 0 13117 13115 0x00004000
> [404571.537263] Call Trace:
> [404571.537524] ? __schedule+0x3ae/0x7b0
> [404571.537788] schedule+0x3a/0xb0
> [404571.538066] wait_current_trans+0xc8/0x100 [btrfs]
> [404571.538349] ? remove_wait_queue+0x60/0x60
> [404571.538680] start_transaction+0x33c/0x500 [btrfs]
> [404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs]
> [404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs]
> [404571.539866] extent_fiemap+0x2ce/0x650 [btrfs]
> [404571.540170] do_vfs_ioctl+0x526/0x6f0
> [404571.540436] ksys_ioctl+0x70/0x80
> [404571.540734] __x64_sys_ioctl+0x16/0x20
> [404571.540997] do_syscall_64+0x60/0x1d0
> [404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> (...)
> [404571.543729] btrfs D 0 14210 14208 0x00004000
> [404571.544023] Call Trace:
> [404571.544275] ? __schedule+0x3ae/0x7b0
> [404571.544526] ? wait_for_completion+0x112/0x1a0
> [404571.544795] schedule+0x3a/0xb0
> [404571.545064] schedule_timeout+0x1ff/0x390
> [404571.545351] ? lock_acquire+0xa6/0x190
> [404571.545638] ? wait_for_completion+0x49/0x1a0
> [404571.545890] ? wait_for_completion+0x112/0x1a0
> [404571.546228] wait_for_completion+0x131/0x1a0
> [404571.546503] ? wake_up_q+0x70/0x70
> [404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
> [404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
> [404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
> [404571.547703] ? remove_wait_queue+0x60/0x60
> [404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs]
> [404571.548226] ? __sb_start_write+0xd4/0x1c0
> [404571.548512] ? mnt_want_write_file+0x24/0x50
> [404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
> [404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
> [404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs]
> [404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0
> [404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0
> [404571.550064] ? __handle_mm_fault+0xe3e/0x11f0
> [404571.550306] ? do_raw_spin_unlock+0x49/0xc0
> [404571.550608] ? _raw_spin_unlock+0x24/0x30
> [404571.550976] ? __handle_mm_fault+0xedf/0x11f0
> [404571.551319] ? do_vfs_ioctl+0xa2/0x6f0
> [404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
> [404571.552087] do_vfs_ioctl+0xa2/0x6f0
> [404571.552355] ksys_ioctl+0x70/0x80
> [404571.552621] __x64_sys_ioctl+0x16/0x20
> [404571.552864] do_syscall_64+0x60/0x1d0
> [404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> (...)
>
> If we were joining the transaction instead of attaching to it, we would
> not risk a deadlock because a join only blocks if the transaction is in a
> state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
> flush performed by a transaction is done before it reaches that state,
> when it is in the state TRANS_STATE_COMMIT_START. However a transaction
> join is intended for use cases where we do modify the filesystem, and
> fiemap only needs to peek at delayed references from the current
> transaction in order to determine if extents are shared, and, besides
> that, when there is no current transaction or when it blocks to wait for
> a current committing transaction to complete, it creates a new transaction
> without reserving any space. Such unnecessary transactions, besides doing
> unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
> rotation of the precious backup roots.
>
> So fix this by adding a new transaction join variant, named join_nostart,
> which behaves like the regular join, but it does not create a transaction
> when none currently exists or after waiting for a committing transaction
> to complete.
>
> Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap")
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Queued for 5.3, thanks.