On Mon, Dec 16, 2013 at 11:57 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
> On Mon, Dec 16, 2013 at 11:48:08AM +0000, Filipe David Manana wrote:
>> On Mon, Dec 16, 2013 at 11:45 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>> > On Mon, Dec 16, 2013 at 11:05:31AM +0000, Filipe David Manana wrote:
>> >> On Mon, Dec 16, 2013 at 9:27 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>> >> > On Tue, Nov 19, 2013 at 10:29:35PM +0000, Filipe David Borba Manana wrote:
>> >> >> The inode eviction can be very slow, because during eviction we
>> >> >> tell the VFS to truncate all of the inode's pages. This results
>> >> >> in calls to btrfs_invalidatepage() which in turn does calls to
>> >> >> lock_extent_bits() and clear_extent_bit(). These calls result in
>> >> >> too many merges and splits of extent_state structures, which
>> >> >> consume a lot of time and cpu when the inode has many pages. In
>> >> >> some scenarios I have experienced umount times higher than 15
>> >> >> minutes, even when there's no pending IO (after a btrfs fs sync).
>> >> >>
>> >> >> A quick way to reproduce this issue:
>> >> >>
>> >> >> $ mkfs.btrfs -f /dev/sdb3
>> >> >> $ mount /dev/sdb3 /mnt/btrfs
>> >> >> $ cd /mnt/btrfs
>> >> >> $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
>> >> >> --file-test-mode=seqwr --num-threads=128 \
>> >> >> --file-block-size=16384 --max-time=60 --max-requests=0 run
>> >> >> $ time btrfs fi sync .
>> >> >> FSSync '.'
>> >> >>
>> >> >> real 0m25.457s
>> >> >> user 0m0.000s
>> >> >> sys 0m0.092s
>> >> >> $ cd ..
>> >> >> $ time umount /mnt/btrfs
>> >> >>
>> >> >> real 1m38.234s
>> >> >> user 0m0.000s
>> >> >> sys 1m25.760s
>> >> >>
>> >> >
>> >> > What about the time of umount after 'sync'?
>> >>
>> >> Same huge difference.
>> >> Thanks.
>> >
>> > Not seeing that huge one with the latest btrfs, maybe because your memory is
>> > rather larger.
>>
>> Not sure if I understand you.
>> Latest btrfs-next has this change integrated. Was the test below with
>> it integrated? You would have to compare it with a build without this
>> change.
>
> I'm testing the script with Chris's upstream repo, not btrfs-next, and umount
> is normal.
>
> It's possible that some patches merged in btrfs-next make umount's latency longer
> than expected.
The umount example was just a simple way to show inode eviction was
taking a long time not waiting for or doing IO.
And yes, my test was performed on a machine with a large amount of ram
(32Gb) compared to that tests total file size.
thanks
>
> thanks,
> -liubo
>
>>
>> Thanks.
>>
>> >
>> > time sync
>> > FSSync '/mnt/btrfs'
>> >
>> > real 0m17.006s
>> > user 0m0.004s
>> > sys 0m0.056s
>> >
>> > time umount /mnt/btrfs
>> >
>> > real 0m0.910s
>> > user 0m0.003s
>> > sys 0m0.715s
>> >
>> > -liubo
>> >
>> >>
>> >> >
>> >> > The following ext4 uses sync while btrfs uses 'btrfs filesystem sync'.
>> >> >
>> >> > I don't think they are the same thing.
>> >> >
>> >> > -liubo
>> >> >
>> >> >> The same test on ext4 runs much faster:
>> >> >>
>> >> >> $ mkfs.ext4 /dev/sdb3
>> >> >> $ mount /dev/sdb3 /mnt/ext4
>> >> >> $ cd /mnt/ext4
>> >> >> $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
>> >> >> --file-test-mode=seqwr --num-threads=128 \
>> >> >> --file-block-size=16384 --max-time=60 --max-requests=0 run
>> >> >> $ sync
>> >> >> $ cd ..
>> >> >> $ time umount /mnt/ext4
>> >> >>
>> >> >> real 0m3.626s
>> >> >> user 0m0.004s
>> >> >> sys 0m3.012s
>> >> >>
>> >> >> After this patch, the unmount (inode evictions) is much faster:
>> >> >>
>> >> >> $ mkfs.btrfs -f /dev/sdb3
>> >> >> $ mount /dev/sdb3 /mnt/btrfs
>> >> >> $ cd /mnt/btrfs
>> >> >> $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
>> >> >> --file-test-mode=seqwr --num-threads=128 \
>> >> >> --file-block-size=16384 --max-time=60 --max-requests=0 run
>> >> >> $ time btrfs fi sync .
>> >> >> FSSync '.'
>> >> >>
>> >> >> real 0m26.774s
>> >> >> user 0m0.000s
>> >> >> sys 0m0.084s
>> >> >> $ cd ..
>> >> >> $ time umount /mnt/btrfs
>> >> >>
>> >> >> real 0m1.811s
>> >> >> user 0m0.000s
>> >> >> sys 0m1.564s
>> >> >
>> >> >>
>> >> >> Signed-off-by: Filipe David Borba Manana <fdmanana@xxxxxxxxx>
>> >> >> ---
>> >> >> fs/btrfs/inode.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++--------
>> >> >> 1 file changed, 84 insertions(+), 14 deletions(-)
>> >> >>
>> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> >> >> index 5a5de36..e889779 100644
>> >> >> --- a/fs/btrfs/inode.c
>> >> >> +++ b/fs/btrfs/inode.c
>> >> >> @@ -4488,6 +4488,62 @@ static int btrfs_setattr(struct dentry *dentry, struct iattr *attr)
>> >> >> return err;
>> >> >> }
>> >> >>
>> >> >> +/*
>> >> >> + * While truncating the inode pages during eviction, we get the VFS calling
>> >> >> + * btrfs_invalidatepage() against each page of the inode. This is slow because
>> >> >> + * the calls to btrfs_invalidatepage() result in a huge amount of calls to
>> >> >> + * lock_extent_bits() and clear_extent_bit(), which keep merging and splitting
>> >> >> + * extent_state structures over and over, wasting lots of time.
>> >> >> + *
>> >> >> + * Therefore if the inode is being evicted, let btrfs_invalidatepage() skip all
>> >> >> + * those expensive operations on a per page basis and do only the ordered io
>> >> >> + * finishing, while we release here the extent_map and extent_state structures,
>> >> >> + * without the excessive merging and splitting.
>> >> >> + */
>> >> >> +static void evict_inode_truncate_pages(struct inode *inode)
>> >> >> +{
>> >> >> + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>> >> >> + struct extent_map_tree *map_tree = &BTRFS_I(inode)->extent_tree;
>> >> >> + struct rb_node *node;
>> >> >> +
>> >> >> + ASSERT(inode->i_state & I_FREEING);
>> >> >> + truncate_inode_pages(&inode->i_data, 0);
>> >> >> +
>> >> >> + write_lock(&map_tree->lock);
>> >> >> + while (!RB_EMPTY_ROOT(&map_tree->map)) {
>> >> >> + struct extent_map *em;
>> >> >> +
>> >> >> + node = rb_first(&map_tree->map);
>> >> >> + em = rb_entry(node, struct extent_map, rb_node);
>> >> >> + remove_extent_mapping(map_tree, em);
>> >> >> + free_extent_map(em);
>> >> >> + }
>> >> >> + write_unlock(&map_tree->lock);
>> >> >> +
>> >> >> + spin_lock(&io_tree->lock);
>> >> >> + while (!RB_EMPTY_ROOT(&io_tree->state)) {
>> >> >> + struct extent_state *state;
>> >> >> + struct extent_state *cached_state = NULL;
>> >> >> +
>> >> >> + node = rb_first(&io_tree->state);
>> >> >> + state = rb_entry(node, struct extent_state, rb_node);
>> >> >> + atomic_inc(&state->refs);
>> >> >> + spin_unlock(&io_tree->lock);
>> >> >> +
>> >> >> + lock_extent_bits(io_tree, state->start, state->end,
>> >> >> + 0, &cached_state);
>> >> >> + clear_extent_bit(io_tree, state->start, state->end,
>> >> >> + EXTENT_LOCKED | EXTENT_DIRTY |
>> >> >> + EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>> >> >> + EXTENT_DEFRAG, 1, 1,
>> >> >> + &cached_state, GFP_NOFS);
>> >> >> + free_extent_state(state);
>> >> >> +
>> >> >> + spin_lock(&io_tree->lock);
>> >> >> + }
>> >> >> + spin_unlock(&io_tree->lock);
>> >> >> +}
>> >> >> +
>> >> >> void btrfs_evict_inode(struct inode *inode)
>> >> >> {
>> >> >> struct btrfs_trans_handle *trans;
>> >> >> @@ -4498,7 +4554,8 @@ void btrfs_evict_inode(struct inode *inode)
>> >> >>
>> >> >> trace_btrfs_inode_evict(inode);
>> >> >>
>> >> >> - truncate_inode_pages(&inode->i_data, 0);
>> >> >> + evict_inode_truncate_pages(inode);
>> >> >> +
>> >> >> if (inode->i_nlink &&
>> >> >> ((btrfs_root_refs(&root->root_item) != 0 &&
>> >> >> root->root_key.objectid != BTRFS_ROOT_TREE_OBJECTID) ||
>> >> >> @@ -7379,6 +7436,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>> >> >> struct extent_state *cached_state = NULL;
>> >> >> u64 page_start = page_offset(page);
>> >> >> u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
>> >> >> + int inode_evicting = inode->i_state & I_FREEING;
>> >> >>
>> >> >> /*
>> >> >> * we have the page locked, so new writeback can't start,
>> >> >> @@ -7394,17 +7452,21 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>> >> >> btrfs_releasepage(page, GFP_NOFS);
>> >> >> return;
>> >> >> }
>> >> >> - lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
>> >> >> - ordered = btrfs_lookup_ordered_extent(inode, page_offset(page));
>> >> >> +
>> >> >> + if (!inode_evicting)
>> >> >> + lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
>> >> >> + ordered = btrfs_lookup_ordered_extent(inode, page_start);
>> >> >> if (ordered) {
>> >> >> /*
>> >> >> * IO on this page will never be started, so we need
>> >> >> * to account for any ordered extents now
>> >> >> */
>> >> >> - clear_extent_bit(tree, page_start, page_end,
>> >> >> - EXTENT_DIRTY | EXTENT_DELALLOC |
>> >> >> - EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>> >> >> - EXTENT_DEFRAG, 1, 0, &cached_state, GFP_NOFS);
>> >> >> + if (!inode_evicting)
>> >> >> + clear_extent_bit(tree, page_start, page_end,
>> >> >> + EXTENT_DIRTY | EXTENT_DELALLOC |
>> >> >> + EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>> >> >> + EXTENT_DEFRAG, 1, 0, &cached_state,
>> >> >> + GFP_NOFS);
>> >> >> /*
>> >> >> * whoever cleared the private bit is responsible
>> >> >> * for the finish_ordered_io
>> >> >> @@ -7428,14 +7490,22 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>> >> >> btrfs_finish_ordered_io(ordered);
>> >> >> }
>> >> >> btrfs_put_ordered_extent(ordered);
>> >> >> - cached_state = NULL;
>> >> >> - lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
>> >> >> + if (!inode_evicting) {
>> >> >> + cached_state = NULL;
>> >> >> + lock_extent_bits(tree, page_start, page_end, 0,
>> >> >> + &cached_state);
>> >> >> + }
>> >> >> + }
>> >> >> +
>> >> >> + if (!inode_evicting) {
>> >> >> + clear_extent_bit(tree, page_start, page_end,
>> >> >> + EXTENT_LOCKED | EXTENT_DIRTY |
>> >> >> + EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>> >> >> + EXTENT_DEFRAG, 1, 1,
>> >> >> + &cached_state, GFP_NOFS);
>> >> >> +
>> >> >> + __btrfs_releasepage(page, GFP_NOFS);
>> >> >> }
>> >> >> - clear_extent_bit(tree, page_start, page_end,
>> >> >> - EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC |
>> >> >> - EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 1,
>> >> >> - &cached_state, GFP_NOFS);
>> >> >> - __btrfs_releasepage(page, GFP_NOFS);
>> >> >>
>> >> >> ClearPageChecked(page);
>> >> >> if (PagePrivate(page)) {
>> >> >> --
>> >> >> 1.7.9.5
>> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >>
>> >> --
>> >> Filipe David Manana,
>> >>
>> >> "Reasonable men adapt themselves to the world.
>> >> Unreasonable men adapt the world to themselves.
>> >> That's why all progress depends on unreasonable men."
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>> Unreasonable men adapt the world to themselves.
>> That's why all progress depends on unreasonable men."
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html