[PATCH 3/3] btrfs: describe the space reservation system in general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Add another comment to cover how the space reservation system works
generally.  This covers the actual reservation flow, as well as how
flushing is handled.

Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
---
 fs/btrfs/space-info.c | 128 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d3befc536a7f..6de1fbe2835a 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -10,6 +10,134 @@
 #include "transaction.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES SPACE RESERVATION WORK
+ *
+ * If you want to know about delalloc specifically, there is a separate comment
+ * for that with the delalloc code.  This comment is about how the whole system
+ * works generally.
+ *
+ * BASIC CONCEPTS
+ *
+ *   1) space_info.  This is the ultimate arbiter of how much space we can use.
+ *   There's a description of the bytes_ fields with the struct declaration,
+ *   refer to that for specifics on each field.  Suffice it to say that for
+ *   reservations we care about total_bytes - SUM(space_info->bytes_) when
+ *   determining if there is space to make an allocation.
+ *
+ *   2) block_rsv's.  These are basically buckets for every different type of
+ *   metadata reservation we have.  You can see the comment in the block_rsv
+ *   code on the rules for each type, but generally block_rsv->reserved is how
+ *   much space is accounted for in space_info->bytes_may_use.
+ *
+ *   3) btrfs_calc*_size.  These are the worst case calculations we used based
+ *   on the number of items we will want to modify.  We have one for changing
+ *   items, and one for inserting new items.  Generally we use these helpers to
+ *   determine the size of the block reserves, and then use the actual bytes
+ *   values to adjust the space_info counters.
+ *
+ * MAKING RESERVATIONS, THE NORMAL CASE
+ *
+ *   Things wanting to make reservations will calculate the size that they want
+ *   and make a reservation request.  If there is sufficient space, and there
+ *   are no current reservations pending, we will adjust
+ *   space_info->bytes_may_use by this amount.
+ *
+ *   Once we allocate an extent, we will add that size to ->bytes_reserved and
+ *   subtract the size from ->bytes_may_use.  Once that extent is written out we
+ *   subtract that value from ->bytes_reserved and add it to ->bytes_used.
+ *
+ *   If there is an error at any point the reserver is responsible for dropping
+ *   its reservation from ->bytes_may_use.
+ *
+ * MAKING RESERVATIONS, FLUSHING
+ *
+ *   If we are unable to satisfy our reservation, or if there are pending
+ *   reservations already, we will create a reserve ticket and add ourselves to
+ *   the appropriate list.  This is controlled by btrfs_reserve_flush_enum.  For
+ *   simplicity sake this boils down to two cases, priority and normal.
+ *
+ *   1) Priority.  These reservations are important and have limited ability to
+ *   flush space.  For example, the relocation code currently tries to make a
+ *   reservation under a transaction commit, thus it cannot wait on anything
+ *   that may want to commit the transaction.  These tasks will add themselves
+ *   to the priority list and thus get any new space first, and then they can
+ *   flush space directly in their own context that is safe for them to do
+ *   without causing a deadlock.
+ *
+ *   2) Normal.  These reservations can wait forever on anything, because the do
+ *   not hold resources that they would deadlock on.  These tickets simply go to
+ *   sleep and start an async thread that will flush space on their behalf.
+ *   Every time one of the ->bytes_* counters is adjusted for the space info, we
+ *   will check to see if there is enough space to satisfy the requests (in
+ *   order) on either of our lists.  If there is enough space we will set the
+ *   ticket->bytes = 0, and wake the task up.  If we flush a few times and fail
+ *   to make any progress we will wake up all of the tickets and fail them all.
+ *
+ * THE FLUSHING STATES
+ *
+ *   Generally speaking we will have two cases for each state, a "nice" state
+ *   and a "ALL THE THINGS" state.  In btrfs we delay a lot of work in order to
+ *   reduce the locking over head on the various trees, and even to keep from
+ *   doing any work at all in the case of delayed refs.  Each of these delayed
+ *   things however hold reservations, and so letting them run allows us to
+ *   reclaim space so we can make new reservations.
+ *
+ *   FLUSH_DELAYED_ITEMS
+ *     Every inode has a delayed item to update the inode.  Take a simple write
+ *     for example, we would update the inode item at write time to update the
+ *     mtime, and then again at finish_ordered_io() time in order to update the
+ *     isize or bytes.  We keep these delayed items to coalesce these operations
+ *     into a single operation done on demand.  These are an easy way to reclaim
+ *     metadata space.
+ *
+ *   FLUSH_DELALLOC
+ *     Look at the delalloc comment to get an idea of how much space is reserved
+ *     for delayed allocation.  We can reclaim some of this space simply by
+ *     running delalloc, but usually we need to wait for ordered extents to
+ *     reclaim the bulk of this space.
+ *
+ *   FLUSH_DELAYED_REFS
+ *     We have a block reserve for the outstanding delayed refs space, and every
+ *     delayed ref operation holds a reservation.  Running these is a quick way
+ *     to reclaim space, but we want to hold this until the end because COW can
+ *     churn a lot and we can avoid making some extent tree modifications if we
+ *     are able to delay for as long as possible.
+ *
+ *   ALLOC_CHUNK
+ *     We will skip this the first time through space reservation, because of
+ *     overcommit and we don't want to have a lot of useless metadata space when
+ *     our worst case reservations will likely never come true.
+ *
+ *   RUN_DELAYED_IPUTS
+ *     If we're freeing inodes we're likely freeing checksums, file extent
+ *     items, and extent tree items.  Loads of space could be freed up by these
+ *     operations, however they won't be usable until the transaction commits.
+ *
+ *   COMMIT_TRANS
+ *     may_commit_transaction() is the ultimate arbiter on wether we commit the
+ *     transaction or not.  In order to avoid constantly churning we do all the
+ *     above flushing first and then commit the transaction as the last resort.
+ *     However we need to take into account things like pinned space that would
+ *     be freed, plus any delayed work we may not have gotten rid of in the case
+ *     of metadata.
+ *
+ * OVERCOMMIT
+ *   Because we hold so many reservations for metadata we will allow you to
+ *   reserve more space than is currently free in the currently allocate
+ *   metadata space.  This only happens with metadata, data does not allow
+ *   overcommitting.
+ *
+ *   You can see the current logic for when we allow overcommit in
+ *   btrfs_can_overcommit(), but it only applies to unallocated space.  If there
+ *   is no unallocated space to be had, all reservations are kept within the
+ *   free space in the allocated metadata chunks.
+ *
+ *   Because of overcommitting, you generally want to use the
+ *   btrfs_can_overcommit() logic for metadata allocations, as it does the right
+ *   thing with or without extra unallocated space.
+ */
+
 u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 			  bool may_use_included)
 {
-- 
2.24.1




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux