On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> Zygo Blaxell and other users have reported occasional hangs while an
> inode is being evicted, leading to traces like the following:
>
> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2
> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000
> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
> [ 5281.981396] Call Trace:
> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83
> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148
> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5
> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
> [ 5282.001299] 1 lock held by rm/20488:
> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
>
> This happens when we have readahead, which calls readpages(), happening
> right before the inode eviction handler is invoked. So the reason is
> essentially:
>
> 1) readpages() is called while a reference on the inode is held, so
> eviction can not be triggered before readpages() returns. It also
> locks one or more ranges in the inode's io_tree (which is done at
> extent_io.c:__do_contiguous_readpages());
>
> 2) readpages() submits several read bios, all with an end io callback
> that runs extent_io.c:end_bio_extent_readpage() and that is executed
> by other task when a bio finishes, corresponding to a work queue
> (fs_info->end_io_workers) worker kthread. This callback unlocks
> the ranges in the inode's io_tree that were previously locked in
> step 1;
>
> 3) readpages() returns, the reference on the inode is dropped;
>
> 4) One or more of the read bios previously submitted are still not
> complete (their end io callback was not yet invoked or has not
> yet finished execution);
>
> 5) Inode eviction is triggered (through an unlink call for example).
> The inode reference count was not incremented before submitting
> the read bios, therefore this is possible;
>
> 6) The eviction handler starts executing and enters the loop that
> iterates over all extent states in the inode's io_tree;
One question here:
btrfs_evict_inode()->
evict_inode_truncate_pages()->
truncate_inode_pages_final()
truncate_inode_pages_final() finds all pages in the mapping and
lock_page() on each page, if there is a readpage running by other tasks,
it should be blocked on lock_page() until end_bio_extent_readpage()
does a unlock_page().
How does eviction handler bypasses this and enters 'extent_state' loop?
Am I missing something?
Thanks,
-liubo
>
> 7) The loop picks one extent state record and uses its ->start and
> ->end fields, after releasing the inode's io_tree spinlock, to
> call lock_extent_bits() and clear_extent_bit(). The call to lock
> the range [state->start, state->end] blocks because the whole
> range or a part of it was locked by the previous call to
> readpages() and the corresponding end io callback, which unlocks
> the range was not yet executed;
>
> 8) The end io callback for the read bio is executed and unlocks the
> range [state->start, state->end] (or a superset of that range).
> And at clear_extent_bit() the extent_state record state is used
> as a second argument to split_state(), which sets state->start to
> a larger value;
>
> 9) The task executing the eviction handler is woken up by the task
> executing the bio's end io callback (through clear_state_bit) and
> the eviction handler locks the range
> [old value for state->start, state->end]. Shortly after, when
> calling clear_extent_bit(), it unlocks the range
> [new value for state->start, state->end], so it ends up unlocking
> only part of the range that it locked, leaving an extent state
> record in the io_tree that represents the unlocked subrange;
>
> 10) The eviction handler loop, in its next iteration, gets the
> extent_state record for the subrange that it did not unlock in the
> previous step and then tries to lock it, resulting in an hang.
>
> So fix this by not using the ->start and ->end fields of an existing
> extent_state record. This is a simple solution, and an alternative
> could be to bump the inode's reference count before submitting each
> read bio and having it dropped in the bio's end io callback. But that
> would be a more invasive/complex change and would not protect against
> other possible places that are not holding a reference on the inode
> as well. Something to consider in the future.
>
> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
> issue, a set of scripts to trigger it and testing this fix.
>
> Reported-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
> Tested-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
> ---
> fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
> 1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0020b56..ef05800 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode)
> }
> write_unlock(&map_tree->lock);
>
> + /*
> + * Keep looping until we have no more ranges in the io tree.
> + * We can have ongoing bios started by readpages (called from readahead)
> + * that didn't get their end io callbacks called yet or they are still
> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some
> + * ranges can still be locked and eviction started because before
> + * submitting those bios, which are executed by a separate task (work
> + * queue kthread), inode references (inode->i_count) were not taken
> + * (which would be dropped in the end io callback of each bio).
> + * Therefore here we effectively end up waiting for those bios and
> + * anyone else holding locked ranges without having bumped the inode's
> + * reference count - if we don't do it, when they access the inode's
> + * io_tree to unlock a range it may be too late, leading to an
> + * use-after-free issue.
> + */
> spin_lock(&io_tree->lock);
> while (!RB_EMPTY_ROOT(&io_tree->state)) {
> struct extent_state *state;
> struct extent_state *cached_state = NULL;
> + u64 start;
> + u64 end;
>
> node = rb_first(&io_tree->state);
> state = rb_entry(node, struct extent_state, rb_node);
> - atomic_inc(&state->refs);
> + start = state->start;
> + end = state->end;
> spin_unlock(&io_tree->lock);
>
> - lock_extent_bits(io_tree, state->start, state->end,
> - 0, &cached_state);
> - clear_extent_bit(io_tree, state->start, state->end,
> + lock_extent_bits(io_tree, start, end, 0, &cached_state);
> + clear_extent_bit(io_tree, start, end,
> EXTENT_LOCKED | EXTENT_DIRTY |
> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
> EXTENT_DEFRAG, 1, 1,
> &cached_state, GFP_NOFS);
> - free_extent_state(state);
>
> cond_resched();
> spin_lock(&io_tree->lock);
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html