Re: [PATCH 24/24] btrfs: add a comment explaining the data flush steps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3.02.20 г. 22:49 ч., Josef Bacik wrote:
> The data flushing steps are not obvious to people other than myself and
> Chris.  Write a giant comment explaining the reasoning behind each flush
> step for data as well as why it is in that particular order.
> 
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> ---
>  fs/btrfs/space-info.c | 46 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 18a31d96bbbd..d3befc536a7f 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -780,6 +780,52 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  	} while (flush_state <= COMMIT_TRANS);
>  }
>  
> +/*
> + * FLUSH_DELALLOC_WAIT:
> + *   Space is free'd from flushing delalloc in one of two ways.
> + *
> + *   1) compression is on and we allocate less space than we reserved.
> + *   2) We are overwriting existing space.
> + *
> + *   For #1 that extra space is reclaimed as soon as the delalloc pages are
> + *   cow'ed, by way of btrfs_add_reserved_bytes() which adds the actual extent
> + *   length to ->bytes_reserved, and subtracts the reserved space from
> + *   ->bytes_may_use.
> + *
> + *   For #2 this is trickier.  Once the ordered extent runs we will drop the
> + *   extent in the range we are overwriting, which creates a delayed ref for
> + *   that freed extent.  This however is not reclaimed until the transaction
> + *   commits, thus the next stages.
> + *
> + * RUN_DELAYED_IPUTS
> + *   If we are freeing inodes, we want to make sure all delayed iputs have
> + *   completed, because they could have been on an inode with i_nlink == 0, and
> + *   thus have been trunated and free'd up space.  But again this space is not
> + *   immediately re-usable, it comes in the form of a delayed ref, which must be
> + *   run and then the transaction must be committed.
> + *
> + * FLUSH_DELAYED_REFS
> + *   The above two cases generate delayed refs that will affect
> + *   ->total_bytes_pinned.  However this counter can be inconsistent with
> + *   reality if there are outstanding delayed refs.  This is because we adjust
> + *   the counter based on the other delayed refs that exist.  So for example, if

IMO this sentence would be clearer if it simply says something along the
lines of " This is because we adjust the counter based solely on the
current set of delayed refs and disregard any on-disk state which might
include more refs".

> + *   we have an extent with 2 references, but we only drop 1, we'll see that
> + *   there is a negative delayed ref count for the extent and assume that the
> + *   space will be free'd, and thus increase ->total_bytes_pinned.
> + *
> + *   Running the delayed refs gives us the actual real view of what will be
> + *   freed at the transaction commit time.  This stage will not actually free
> + *   space for us, it just makes sure that may_commit_transaction() has all of
> + *   the information it needs to make the right decision.

Is there any particular reason why total_bytes_pinned is sort of double
accounted. I.e first add_pinned_bytes is called when a DROP del ref is
added with negative ref. Then during processing of that delref
__btrfs_free_extent either:

a) Removes the ref but doesn't free the extent if there were other,
non-del refs for this extent

b) Removes the extent and calls btrfs_update_block_group to account it
again as pinned (this time also setting the respective ranges as pinned)

This double accounting doesn't really happen because after the
processing of the DROP del ref is finished
cleanup_ref_head->btrfs_cleanup_ref_head_accounting will actually clean
the pinned bytes added at creation time of the DROP del ref. Can we
avoid this and simply rely on the update of total_bytes_pinned when an
extent is freed.


> + *
> + * COMMIT_TRANS
> + *   This is where we reclaim all of the pinned space generated by the previous
> + *   two stages.  We will not commit the transaction if we don't think we're
> + *   likely to satisfy our request, which means if our current free space +
> + *   total_bytes_pinned < reservation we will not commit.  This is why the
> + *   previous states are actually important, to make sure we know for sure
> + *   wether committing the transaction will allow us to make progress.

typo: s/wether/whether/

> + */
>  static const enum btrfs_flush_state data_flush_states[] = {
>  	FLUSH_DELALLOC_WAIT,
>  	RUN_DELAYED_IPUTS,
> 

This looks good and helped me understand the machinery and put some
context into the code.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux