On Mon, Oct 07, 2019 at 06:38:10PM -0400, Dennis Zhou wrote:
> On Mon, Oct 07, 2019 at 04:37:28PM -0400, Josef Bacik wrote:
> > On Mon, Oct 07, 2019 at 04:17:34PM -0400, Dennis Zhou wrote:
> > > Async discard will use the free space cache as backing knowledge for
> > > which extents to discard. This patch plumbs knowledge about which
> > > extents need to be discarded into the free space cache from
> > > unpin_extent_range().
> > >
> > > An untrimmed extent can merge with everything as this is a new region.
> > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > makes life better for find_free_extent(). Additionally, it seems the
> > > size of a trim isn't as problematic as the trim io itself.
> > >
> > > When reading in the free space cache from disk, if sync is set, mark all
> > > extents as trimmed. The current code ensures at transaction commit that
> > > all free space is trimmed when sync is set, so this reflects that.
> > >
> > > Signed-off-by: Dennis Zhou <dennis@xxxxxxxxxx>
> > > ---
> > > fs/btrfs/extent-tree.c | 15 ++++++++++-----
> > > fs/btrfs/free-space-cache.c | 38 ++++++++++++++++++++++++++++++-------
> > > fs/btrfs/free-space-cache.h | 10 +++++++++-
> > > fs/btrfs/inode-map.c | 13 +++++++------
> > > 4 files changed, 57 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 77a5904756c5..b9e3bedad878 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -2782,7 +2782,7 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
> > > }
> > >
> > > static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > - u64 start, u64 end,
> > > + u64 start, u64 end, u32 fsc_flags,
> > > const bool return_free_space)
> > > {
> > > struct btrfs_block_group_cache *cache = NULL;
> > > @@ -2816,7 +2816,9 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > if (start < cache->last_byte_to_unpin) {
> > > len = min(len, cache->last_byte_to_unpin - start);
> > > if (return_free_space)
> > > - btrfs_add_free_space(cache, start, len);
> > > + __btrfs_add_free_space(fs_info,
> > > + cache->free_space_ctl,
> > > + start, len, fsc_flags);
> > > }
> > >
> > > start += len;
> > > @@ -2894,6 +2896,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > >
> > > while (!trans->aborted) {
> > > struct extent_state *cached_state = NULL;
> > > + u32 fsc_flags = 0;
> > >
> > > mutex_lock(&fs_info->unused_bg_unpin_mutex);
> > > ret = find_first_extent_bit(unpin, 0, &start, &end,
> > > @@ -2903,12 +2906,14 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > > break;
> > > }
> > >
> > > - if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > + if (btrfs_test_opt(fs_info, DISCARD_SYNC)) {
> > > ret = btrfs_discard_extent(fs_info, start,
> > > end + 1 - start, NULL);
> > > + fsc_flags |= BTRFS_FSC_TRIMMED;
> > > + }
> > >
> > > clear_extent_dirty(unpin, start, end, &cached_state);
> > > - unpin_extent_range(fs_info, start, end, true);
> > > + unpin_extent_range(fs_info, start, end, fsc_flags, true);
> > > mutex_unlock(&fs_info->unused_bg_unpin_mutex);
> > > free_extent_state(cached_state);
> > > cond_resched();
> > > @@ -5512,7 +5517,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
> > > int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
> > > u64 start, u64 end)
> > > {
> > > - return unpin_extent_range(fs_info, start, end, false);
> > > + return unpin_extent_range(fs_info, start, end, 0, false);
> > > }
> > >
> > > /*
> > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > > index d54dcd0ab230..f119895292b8 100644
> > > --- a/fs/btrfs/free-space-cache.c
> > > +++ b/fs/btrfs/free-space-cache.c
> > > @@ -747,6 +747,14 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> > > goto free_cache;
> > > }
> > >
> > > + /*
> > > + * Sync discard ensures that the free space cache is always
> > > + * trimmed. So when reading this in, the state should reflect
> > > + * that.
> > > + */
> > > + if (btrfs_test_opt(fs_info, DISCARD_SYNC))
> > > + e->flags |= BTRFS_FSC_TRIMMED;
> > > +
> > > if (!e->bytes) {
> > > kmem_cache_free(btrfs_free_space_cachep, e);
> > > goto free_cache;
> > > @@ -2165,6 +2173,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > bool merged = false;
> > > u64 offset = info->offset;
> > > u64 bytes = info->bytes;
> > > + bool is_trimmed = btrfs_free_space_trimmed(info);
> > >
> > > /*
> > > * first we want to see if there is free space adjacent to the range we
> > > @@ -2178,7 +2187,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > else
> > > left_info = tree_search_offset(ctl, offset - 1, 0, 0);
> > >
> > > - if (right_info && !right_info->bitmap) {
> > > + if (right_info && !right_info->bitmap &&
> > > + (!is_trimmed || btrfs_free_space_trimmed(right_info))) {
> > > if (update_stat)
> > > unlink_free_space(ctl, right_info);
> > > else
> > > @@ -2189,7 +2199,8 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
> > > }
> > >
> > > if (left_info && !left_info->bitmap &&
> > > - left_info->offset + left_info->bytes == offset) {
> > > + left_info->offset + left_info->bytes == offset &&
> > > + (!is_trimmed || btrfs_free_space_trimmed(left_info))) {
> >
> > So we allow merging if we haven't trimmed this entry, or if the adjacent entry
> > is already trimmed? This means we'll merge if we trimmed the new entry
> > regardless of the adjacent entries status, or if the new entry is drity. Why is
> > that? Thanks,
> >
>
> This is the tradeoff I called out above here:
>
> > > Absorbing trimmed extents is a tradeoff to for greater coalescing which
> > > makes life better for find_free_extent(). Additionally, it seems the
> > > size of a trim isn't as problematic as the trim io itself.
>
> A problematic example case:
>
> |----trimmed----|/////X/////|-----trimmed-----|
>
> If region X gets freed and returned to the free space cache, we end up
> with the following:
>
> |----trimmed----|-untrimmed-|-----trimmed-----|
>
> This isn't great because now we need to teach find_free_extent() to span
> multiple btrfs_free_space entries, something I didn't want to do. So the
> other option is to overtrim trading for a simpler find_free_extent().
> Then the above becomes:
>
> |-------------------trimmed-------------------|
>
> It makes the assumption that if we're inserting, it's generally is free
> space being returned rather than we needed to slice out from the middle
> of a block. It does still have degenerative cases, but it's better than
> the above. The merging also allows for stuff to come out of bitmaps more
> proactively too.
>
> Also from what it seems, the cost of a discard operation is quite costly
> relative to the amount your discarding (1 larger discard is better than
> several smaller discards) as it will clog up the device too.
OOOOOh I fucking get it now. That's going to need a comment, because it's not
obvious at all.
However I still wonder if this is right. Your above examples are legitimate,
but say you have
| 512mib adding back that isn't trimmed |------- 512mib trimmed ------|
we'll merge these two, but really we should probably trim that 512mib chunk
we're adding right? Thanks,
Josef