On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
> On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@xxxxxxxxxx wrote:
>> From: Filipe Manana <fdmanana@xxxxxxxx>
>>
>> Zygo Blaxell and other users have reported occasional hangs while an
>> inode is being evicted, leading to traces like the following:
>>
>> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
>> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2
>> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000
>> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
>> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
>> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
>> [ 5281.981396] Call Trace:
>> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83
>> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
>> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
>> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
>> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
>> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
>> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148
>> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5
>> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
>> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
>> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
>> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
>> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
>> [ 5282.001299] 1 lock held by rm/20488:
>> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
>>
>> This happens when we have readahead, which calls readpages(), happening
>> right before the inode eviction handler is invoked. So the reason is
>> essentially:
>>
>> 1) readpages() is called while a reference on the inode is held, so
>> eviction can not be triggered before readpages() returns. It also
>> locks one or more ranges in the inode's io_tree (which is done at
>> extent_io.c:__do_contiguous_readpages());
>>
>> 2) readpages() submits several read bios, all with an end io callback
>> that runs extent_io.c:end_bio_extent_readpage() and that is executed
>> by other task when a bio finishes, corresponding to a work queue
>> (fs_info->end_io_workers) worker kthread. This callback unlocks
>> the ranges in the inode's io_tree that were previously locked in
>> step 1;
>>
>> 3) readpages() returns, the reference on the inode is dropped;
>>
>> 4) One or more of the read bios previously submitted are still not
>> complete (their end io callback was not yet invoked or has not
>> yet finished execution);
>>
>> 5) Inode eviction is triggered (through an unlink call for example).
>> The inode reference count was not incremented before submitting
>> the read bios, therefore this is possible;
>>
>> 6) The eviction handler starts executing and enters the loop that
>> iterates over all extent states in the inode's io_tree;
>
> One question here:
> btrfs_evict_inode()->
> evict_inode_truncate_pages()->
> truncate_inode_pages_final()
>
> truncate_inode_pages_final() finds all pages in the mapping and
> lock_page() on each page, if there is a readpage running by other tasks,
> it should be blocked on lock_page() until end_bio_extent_readpage()
> does a unlock_page().
>
> How does eviction handler bypasses this and enters 'extent_state' loop?
>
> Am I missing something?
Yes, you are missing something.
end_bio_extent_readpage() unlocks the pages and only after it does
unlock the ranges in the inode's io tree.
thanks
>
> Thanks,
>
> -liubo
>>
>> 7) The loop picks one extent state record and uses its ->start and
>> ->end fields, after releasing the inode's io_tree spinlock, to
>> call lock_extent_bits() and clear_extent_bit(). The call to lock
>> the range [state->start, state->end] blocks because the whole
>> range or a part of it was locked by the previous call to
>> readpages() and the corresponding end io callback, which unlocks
>> the range was not yet executed;
>>
>> 8) The end io callback for the read bio is executed and unlocks the
>> range [state->start, state->end] (or a superset of that range).
>> And at clear_extent_bit() the extent_state record state is used
>> as a second argument to split_state(), which sets state->start to
>> a larger value;
>>
>> 9) The task executing the eviction handler is woken up by the task
>> executing the bio's end io callback (through clear_state_bit) and
>> the eviction handler locks the range
>> [old value for state->start, state->end]. Shortly after, when
>> calling clear_extent_bit(), it unlocks the range
>> [new value for state->start, state->end], so it ends up unlocking
>> only part of the range that it locked, leaving an extent state
>> record in the io_tree that represents the unlocked subrange;
>>
>> 10) The eviction handler loop, in its next iteration, gets the
>> extent_state record for the subrange that it did not unlock in the
>> previous step and then tries to lock it, resulting in an hang.
>>
>> So fix this by not using the ->start and ->end fields of an existing
>> extent_state record. This is a simple solution, and an alternative
>> could be to bump the inode's reference count before submitting each
>> read bio and having it dropped in the bio's end io callback. But that
>> would be a more invasive/complex change and would not protect against
>> other possible places that are not holding a reference on the inode
>> as well. Something to consider in the future.
>>
>> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
>> issue, a set of scripts to trigger it and testing this fix.
>>
>> Reported-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
>> Tested-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>> ---
>> fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
>> 1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0020b56..ef05800 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode)
>> }
>> write_unlock(&map_tree->lock);
>>
>> + /*
>> + * Keep looping until we have no more ranges in the io tree.
>> + * We can have ongoing bios started by readpages (called from readahead)
>> + * that didn't get their end io callbacks called yet or they are still
>> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means some
>> + * ranges can still be locked and eviction started because before
>> + * submitting those bios, which are executed by a separate task (work
>> + * queue kthread), inode references (inode->i_count) were not taken
>> + * (which would be dropped in the end io callback of each bio).
>> + * Therefore here we effectively end up waiting for those bios and
>> + * anyone else holding locked ranges without having bumped the inode's
>> + * reference count - if we don't do it, when they access the inode's
>> + * io_tree to unlock a range it may be too late, leading to an
>> + * use-after-free issue.
>> + */
>> spin_lock(&io_tree->lock);
>> while (!RB_EMPTY_ROOT(&io_tree->state)) {
>> struct extent_state *state;
>> struct extent_state *cached_state = NULL;
>> + u64 start;
>> + u64 end;
>>
>> node = rb_first(&io_tree->state);
>> state = rb_entry(node, struct extent_state, rb_node);
>> - atomic_inc(&state->refs);
>> + start = state->start;
>> + end = state->end;
>> spin_unlock(&io_tree->lock);
>>
>> - lock_extent_bits(io_tree, state->start, state->end,
>> - 0, &cached_state);
>> - clear_extent_bit(io_tree, state->start, state->end,
>> + lock_extent_bits(io_tree, start, end, 0, &cached_state);
>> + clear_extent_bit(io_tree, start, end,
>> EXTENT_LOCKED | EXTENT_DIRTY |
>> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>> EXTENT_DEFRAG, 1, 1,
>> &cached_state, GFP_NOFS);
>> - free_extent_state(state);
>>
>> cond_resched();
>> spin_lock(&io_tree->lock);
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html