On 2020/2/4 上午4:44, Josef Bacik wrote:
> delalloc space reservation is tricky because it encompasses both data
> and metadata. Make it clear what each side does, the general flow of
> how space is moved throughout the lifetime of a write, and what goes
> into the calculations.
In fact, the lifespan of a write would be super helpful for newbies.
I still remember the pain when trying to understand the whole mechanism
years ago.
>
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> ---
> fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 90 insertions(+)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c13d8609cc99..09a9c01fc1b5 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -9,6 +9,96 @@
> #include "qgroup.h"
> #include "block-group.h"
>
> +/*
> + * HOW DOES THIS WORK
> + *
> + * There are two stages to data reservations, one for data and one for metadata
> + * to handle the new extents and checksums generated by writing data.
> + *
> + *
> + * DATA RESERVATION
> + * The data reservation stuff is relatively straightforward. We want X bytes,
> + * and thus need to make sure we have X bytes free in data space in order to
> + * write that data. If there is not X bytes free, allocate data chunks until
> + * we can satisfy that reservation. If we can no longer allocate data chunks,
> + * attempt to flush space to see if we can now make the reservaiton. See the
> + * comment for data_flush_states to see how that flushing is accomplished.
What about such less words version?
We want X bytes of data space.
If there is not enough, try the following methods in order:
- Allocate new data chunk
- Flush space
See comment for data_flush_states
> + *
> + * Once this space is reserved, it is added to space_info->bytes_may_use. The
> + * caller must keep track of this reservation and free it up if it is never
> + * used. With the buffered IO case this is handled via the EXTENT_DELALLOC
> + * bit's on the inode's io_tree. For direct IO it's more straightforward, we
> + * take the reservation at the start of the operation, and if we write less
> + * than we reserved we free the excess.
This part involves the lifespan and state machine of data.
I guess more explanation on the state machine would help a lot.
Like:
Page clean
|
+- btrfs_buffered_write()
| Reserve data space for data, metadata space for csum/file
| extents/inodes.
|
Page dirty
|
+- run_delalloc_range()
| Allocate data extents, submit ordered extents to do csum calculation
| and bio submission
Page write back
|
+- finish_oredred_io()
| Insert csum and file extents
|
Page clean
Although I'm not sure if such lifespan should belong to delalloc-space.c.
> + *
> + * For the buffered case our reservation will take one of two paths
> + *
> + * 1) It is allocated. In find_free_extent() we will call
> + * btrfs_add_reserved_bytes() with the size of the extent we made, along with
> + * the size that we are covering with this allocation. For non-compressed
> + * these will be the same thing, but for compressed they could be different.
> + * In any case, we increase space_info->bytes_reserved by the extent size, and
> + * reduce the space_info->bytes_may_use by the ram_bytes size. From now on
> + * the handling of this reserved space is the responsibility of the ordered
> + * extent or the cow path.
> + *
> + * 2) There is an error, and we free it. This is handled with the
> + * EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
> + * io_tree.
> + *
> + * METADATA RESERVATION
> + * The general metadata reservation lifetimes are discussed elsewhere, this
> + * will just focus on how it is used for delalloc space.
> + *
> + * There are 3 things we are keeping reservations for.
It looks the 3 things are too detailed I guess?
It's definitely educational, but not sure if it fits the introduction
nature of such comment.
I guess we only need to mention:
- Objective
How this meta rsv is used for (inode item, file extents, csum)
- Location of interest
Important details. (outstanding extents and DELALLOC bits for metadata
rsv calculation)
- Timing of such rsv
When we reserve/update, use and release. (function entrance)
Then it should be enough for people to dig for their own interest.
Thanks,
Qu
> + *
> + * 1) Updating the inode item. We hold a reservation for this inode as long
> + * as there are dirty bytes outstanding for this inode. This is because we
> + * may update the inode multiple times throughout an operation, and there is
> + * no telling when we may have to do a full cow back to that inode item. Thus
> + * we must always hold a reservation.
> + *
> + * 2) Adding an extent item. This is trickier, so a few sub points
> + *
> + * a) We keep track of how many extents an inode may need to create in
> + * inode->outstanding_extents. This is how many items we will have reserved
> + * for the extents for this inode.
> + *
> + * b) count_max_extents() is used to figure out how many extent items we
> + * will need based on the contiguous area we have dirtied. Thus if we are
> + * writing 4k extents but they coalesce into a very large extent, we will
> + * break this into smaller extents which means we'll need a reservation for
> + * each of those extents.
> + *
> + * c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
> + * the nummber of extents needed for the contiguous area we just created,
> + * and add that to inode->outstanding_extents.
> + *
> + * d) We have no idea at reservation time how this new extent fits into
> + * existing extents. We unconditionally use count_max_extents() on the
> + * reservation we are currently doing. The reservation _must_ use
> + * btrfs_delalloc_release_extents() once it has done it's work to clear up
> + * this outstanding extents. This means that we will transiently have too
> + * many extent reservations for this inode than we need. For example say we
> + * have a clean inode, and we do a buffered write of 4k. The reservation
> + * code will mod outstanding_extents to 1, and then set_delalloc will
> + * increase it to 2. Then once we are finished,
> + * btrfs_delalloc_release_extents() will drop it back down to 1 again.
> + *
> + * e) Ordered extents take on the responsibility of their extent. We know
> + * that the ordered extent represents a single inode item, so it will modify
> + * ->outstanding_extents by 1, and will clear delalloc which will adjust the
> + * ->outstanding_extents by whatever value it needs to be adjusted to. Once
> + * the ordered io is finished we drop the ->outstanding_extents by 1 and if
> + * we are 0 we drop our inode item reservation as well.
> + *
> + * 3) Adding csums for the range. This is more straightforward than the
> + * extent items, as we just want to hold the number of bytes we'll need for
> + * checksums until the ordered extent is removed. If there is an error it is
> + * cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
> + * on the inode io_tree.
> + */
> +
> int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
> {
> struct btrfs_root *root = inode->root;
>
Attachment:
signature.asc
Description: OpenPGP digital signature
