[PATCH 2/3] btrfs: add a comment describing delalloc space reservation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



delalloc space reservation is tricky because it encompasses both data
and metadata.  Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.

Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
---
 fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index c13d8609cc99..09a9c01fc1b5 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -9,6 +9,96 @@
 #include "qgroup.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES THIS WORK
+ *
+ * There are two stages to data reservations, one for data and one for metadata
+ * to handle the new extents and checksums generated by writing data.
+ *
+ *
+ * DATA RESERVATION
+ *   The data reservation stuff is relatively straightforward.  We want X bytes,
+ *   and thus need to make sure we have X bytes free in data space in order to
+ *   write that data.  If there is not X bytes free, allocate data chunks until
+ *   we can satisfy that reservation.  If we can no longer allocate data chunks,
+ *   attempt to flush space to see if we can now make the reservaiton.  See the
+ *   comment for data_flush_states to see how that flushing is accomplished.
+ *
+ *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
+ *   caller must keep track of this reservation and free it up if it is never
+ *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
+ *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
+ *   take the reservation at the start of the operation, and if we write less
+ *   than we reserved we free the excess.
+ *
+ *   For the buffered case our reservation will take one of two paths
+ *
+ *   1) It is allocated.  In find_free_extent() we will call
+ *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
+ *   the size that we are covering with this allocation.  For non-compressed
+ *   these will be the same thing, but for compressed they could be different.
+ *   In any case, we increase space_info->bytes_reserved by the extent size, and
+ *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
+ *   the handling of this reserved space is the responsibility of the ordered
+ *   extent or the cow path.
+ *
+ *   2) There is an error, and we free it.  This is handled with the
+ *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
+ *   io_tree.
+ *
+ * METADATA RESERVATION
+ *   The general metadata reservation lifetimes are discussed elsewhere, this
+ *   will just focus on how it is used for delalloc space.
+ *
+ *   There are 3 things we are keeping reservations for.
+ *
+ *   1) Updating the inode item.  We hold a reservation for this inode as long
+ *   as there are dirty bytes outstanding for this inode.  This is because we
+ *   may update the inode multiple times throughout an operation, and there is
+ *   no telling when we may have to do a full cow back to that inode item.  Thus
+ *   we must always hold a reservation.
+ *
+ *   2) Adding an extent item.  This is trickier, so a few sub points
+ *
+ *     a) We keep track of how many extents an inode may need to create in
+ *     inode->outstanding_extents.  This is how many items we will have reserved
+ *     for the extents for this inode.
+ *
+ *     b) count_max_extents() is used to figure out how many extent items we
+ *     will need based on the contiguous area we have dirtied.  Thus if we are
+ *     writing 4k extents but they coalesce into a very large extent, we will
+ *     break this into smaller extents which means we'll need a reservation for
+ *     each of those extents.
+ *
+ *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
+ *     the nummber of extents needed for the contiguous area we just created,
+ *     and add that to inode->outstanding_extents.
+ *
+ *     d) We have no idea at reservation time how this new extent fits into
+ *     existing extents.  We unconditionally use count_max_extents() on the
+ *     reservation we are currently doing.  The reservation _must_ use
+ *     btrfs_delalloc_release_extents() once it has done it's work to clear up
+ *     this outstanding extents.  This means that we will transiently have too
+ *     many extent reservations for this inode than we need.  For example say we
+ *     have a clean inode, and we do a buffered write of 4k.  The reservation
+ *     code will mod outstanding_extents to 1, and then set_delalloc will
+ *     increase it to 2.  Then once we are finished,
+ *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
+ *
+ *     e) Ordered extents take on the responsibility of their extent.  We know
+ *     that the ordered extent represents a single inode item, so it will modify
+ *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
+ *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
+ *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
+ *     we are 0 we drop our inode item reservation as well.
+ *
+ *   3) Adding csums for the range.  This is more straightforward than the
+ *   extent items, as we just want to hold the number of bytes we'll need for
+ *   checksums until the ordered extent is removed.  If there is an error it is
+ *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
+ *   on the inode io_tree.
+ */
+
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 {
 	struct btrfs_root *root = inode->root;
-- 
2.24.1




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux