Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 10, 2015 at 12:27 PM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
> On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote:
>> On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>> > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote:
>> >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>> >> > On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@xxxxxxxxxx wrote:
>> >> >> From: Filipe Manana <fdmanana@xxxxxxxx>
>> >> >>
>> >> >> Zygo Blaxell and other users have reported occasional hangs while an
>> >> >> inode is being evicted, leading to traces like the following:
>> >> >>
>> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
>> >> >> [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
>> >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> >> [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
>> >> >> [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
>> >> >> [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
>> >> >> [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
>> >> >> [ 5281.981396] Call Trace:
>> >> >> [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
>> >> >> [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
>> >> >> [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
>> >> >> [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
>> >> >> [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
>> >> >> [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
>> >> >> [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
>> >> >> [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
>> >> >> [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
>> >> >> [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
>> >> >> [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
>> >> >> [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> >> >> [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
>> >> >> [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
>> >> >> [ 5282.001299] 1 lock held by rm/20488:
>> >> >> [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
>> >> >>
>> >> >> This happens when we have readahead, which calls readpages(), happening
>> >> >> right before the inode eviction handler is invoked. So the reason is
>> >> >> essentially:
>> >> >>
>> >> >> 1) readpages() is called while a reference on the inode is held, so
>> >> >>    eviction can not be triggered before readpages() returns. It also
>> >> >>    locks one or more ranges in the inode's io_tree (which is done at
>> >> >>    extent_io.c:__do_contiguous_readpages());
>> >> >>
>> >> >> 2) readpages() submits several read bios, all with an end io callback
>> >> >>    that runs extent_io.c:end_bio_extent_readpage() and that is executed
>> >> >>    by other task when a bio finishes, corresponding to a work queue
>> >> >>    (fs_info->end_io_workers) worker kthread. This callback unlocks
>> >> >>    the ranges in the inode's io_tree that were previously locked in
>> >> >>    step 1;
>> >> >>
>> >> >> 3) readpages() returns, the reference on the inode is dropped;
>> >> >>
>> >> >> 4) One or more of the read bios previously submitted are still not
>> >> >>    complete (their end io callback was not yet invoked or has not
>> >> >>    yet finished execution);
>> >> >>
>> >> >> 5) Inode eviction is triggered (through an unlink call for example).
>> >> >>    The inode reference count was not incremented before submitting
>> >> >>    the read bios, therefore this is possible;
>> >> >>
>> >> >> 6) The eviction handler starts executing and enters the loop that
>> >> >>    iterates over all extent states in the inode's io_tree;
>> >> >
>> >> > One question here:
>> >> >  btrfs_evict_inode()->
>> >> >         evict_inode_truncate_pages()->
>> >> >                 truncate_inode_pages_final()
>> >> >
>> >> > truncate_inode_pages_final() finds all pages in the mapping and
>> >> > lock_page() on each page, if there is a readpage running by other tasks,
>> >> > it should be blocked on lock_page() until end_bio_extent_readpage()
>> >> > does a unlock_page().
>> >> >
>> >> > How does eviction handler bypasses this and enters 'extent_state' loop?
>> >> >
>> >> > Am I missing something?
>> >>
>> >> Yes, you are missing something.
>> >> end_bio_extent_readpage() unlocks the pages and only after it does
>> >> unlock the ranges in the inode's io tree.
>> >
>> > ah, that's right, another thing is
>> > truncate_inode_pages_final() gets every page's lock, which means that
>> > all the endio must have been processed, but the below comment in the
>> > code says some bios may still not get their endio called.
>>
>> truncate_inode_pages_final() can lock the pages after
>> end_bio_extent_readpage() unlocks them and before
>> end_bio_extent_readpage() unlocks the ranges in the io tree. I know
>> there's some mangling around that makes the bio's original endio
>> function to be run by an work queue (fs_info->endio_workers) task plus
>> one or two more levels of indirection. But by endio I really meant
>> end_bio_extent_readpage(), as I mention in the comment between
>> parenthesis, and not any of the indirections added by
>> btrfs_submit_bio_hook().
>
>
> Excerpted from the below comment,
>
>> >> >> +      * We can have ongoing bios started by readpages (called from readahead)
>> >> >> +      * that didn't get their end io callbacks called yet or they are still
>> >> >> +      * in progress ((extent_io.c:end_bio_extent_readpage()).
>
> I mean that by the time of truncate_inode_pages_final() ending, all the end_bio_extent_readpage()
> have already been called since pages are needed to be unlock'd to make evict() go on.

When you say "have already been called" I understood completely called
and not still in progress.
Yes, by that time there's only the possibility of being in progress.
I'll send a patch to fix the comment since Chris already picked this.

>
> Thanks,
>
> -liubo
>
>>
>> >
>> > Thanks,
>> >
>> > -liubo
>> >>
>> >> thanks
>> >>
>> >>
>> >> >
>> >> > Thanks,
>> >> >
>> >> > -liubo
>> >> >>
>> >> >> 7) The loop picks one extent state record and uses its ->start and
>> >> >>    ->end fields, after releasing the inode's io_tree spinlock, to
>> >> >>    call lock_extent_bits() and clear_extent_bit(). The call to lock
>> >> >>    the range [state->start, state->end] blocks because the whole
>> >> >>    range or a part of it was locked by the previous call to
>> >> >>    readpages() and the corresponding end io callback, which unlocks
>> >> >>    the range was not yet executed;
>> >> >>
>> >> >> 8) The end io callback for the read bio is executed and unlocks the
>> >> >>    range [state->start, state->end] (or a superset of that range).
>> >> >>    And at clear_extent_bit() the extent_state record state is used
>> >> >>    as a second argument to split_state(), which sets state->start to
>> >> >>    a larger value;
>> >> >>
>> >> >> 9) The task executing the eviction handler is woken up by the task
>> >> >>    executing the bio's end io callback (through clear_state_bit) and
>> >> >>    the eviction handler locks the range
>> >> >>    [old value for state->start, state->end]. Shortly after, when
>> >> >>    calling clear_extent_bit(), it unlocks the range
>> >> >>    [new value for state->start, state->end], so it ends up unlocking
>> >> >>    only part of the range that it locked, leaving an extent state
>> >> >>    record in the io_tree that represents the unlocked subrange;
>> >> >>
>> >> >> 10) The eviction handler loop, in its next iteration, gets the
>> >> >>     extent_state record for the subrange that it did not unlock in the
>> >> >>     previous step and then tries to lock it, resulting in an hang.
>> >> >>
>> >> >> So fix this by not using the ->start and ->end fields of an existing
>> >> >> extent_state record. This is a simple solution, and an alternative
>> >> >> could be to bump the inode's reference count before submitting each
>> >> >> read bio and having it dropped in the bio's end io callback. But that
>> >> >> would be a more invasive/complex change and would not protect against
>> >> >> other possible places that are not holding a reference on the inode
>> >> >> as well. Something to consider in the future.
>> >> >>
>> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
>> >> >> issue, a set of scripts to trigger it and testing this fix.
>> >> >>
>> >> >> Reported-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
>> >> >> Tested-by: Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx>
>> >> >> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>> >> >> ---
>> >> >>  fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
>> >> >>  1 file changed, 21 insertions(+), 5 deletions(-)
>> >> >>
>> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> >> >> index 0020b56..ef05800 100644
>> >> >> --- a/fs/btrfs/inode.c
>> >> >> +++ b/fs/btrfs/inode.c
>> >> >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode)
>> >> >>       }
>> >> >>       write_unlock(&map_tree->lock);
>> >> >>
>> >> >> +     /*
>> >> >> +      * Keep looping until we have no more ranges in the io tree.
>> >> >> +      * We can have ongoing bios started by readpages (called from readahead)
>> >> >> +      * that didn't get their end io callbacks called yet or they are still
>> >> >> +      * in progress ((extent_io.c:end_bio_extent_readpage()). This means some
>> >> >> +      * ranges can still be locked and eviction started because before
>> >> >> +      * submitting those bios, which are executed by a separate task (work
>> >> >> +      * queue kthread), inode references (inode->i_count) were not taken
>> >> >> +      * (which would be dropped in the end io callback of each bio).
>> >> >> +      * Therefore here we effectively end up waiting for those bios and
>> >> >> +      * anyone else holding locked ranges without having bumped the inode's
>> >> >> +      * reference count - if we don't do it, when they access the inode's
>> >> >> +      * io_tree to unlock a range it may be too late, leading to an
>> >> >> +      * use-after-free issue.
>> >> >> +      */
>> >> >>       spin_lock(&io_tree->lock);
>> >> >>       while (!RB_EMPTY_ROOT(&io_tree->state)) {
>> >> >>               struct extent_state *state;
>> >> >>               struct extent_state *cached_state = NULL;
>> >> >> +             u64 start;
>> >> >> +             u64 end;
>> >> >>
>> >> >>               node = rb_first(&io_tree->state);
>> >> >>               state = rb_entry(node, struct extent_state, rb_node);
>> >> >> -             atomic_inc(&state->refs);
>> >> >> +             start = state->start;
>> >> >> +             end = state->end;
>> >> >>               spin_unlock(&io_tree->lock);
>> >> >>
>> >> >> -             lock_extent_bits(io_tree, state->start, state->end,
>> >> >> -                              0, &cached_state);
>> >> >> -             clear_extent_bit(io_tree, state->start, state->end,
>> >> >> +             lock_extent_bits(io_tree, start, end, 0, &cached_state);
>> >> >> +             clear_extent_bit(io_tree, start, end,
>> >> >>                                EXTENT_LOCKED | EXTENT_DIRTY |
>> >> >>                                EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>> >> >>                                EXTENT_DEFRAG, 1, 1,
>> >> >>                                &cached_state, GFP_NOFS);
>> >> >> -             free_extent_state(state);
>> >> >>
>> >> >>               cond_resched();
>> >> >>               spin_lock(&io_tree->lock);
>> >> >> --
>> >> >> 2.1.3
>> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux