On 2/4/20 4:47 AM, Nikolay Borisov wrote:
On 3.02.20 г. 22:49 ч., Josef Bacik wrote:
The data flushing steps are not obvious to people other than myself and
Chris. Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.
Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
---
fs/btrfs/space-info.c | 46 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 18a31d96bbbd..d3befc536a7f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -780,6 +780,52 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
} while (flush_state <= COMMIT_TRANS);
}
+/*
+ * FLUSH_DELALLOC_WAIT:
+ * Space is free'd from flushing delalloc in one of two ways.
+ *
+ * 1) compression is on and we allocate less space than we reserved.
+ * 2) We are overwriting existing space.
+ *
+ * For #1 that extra space is reclaimed as soon as the delalloc pages are
+ * cow'ed, by way of btrfs_add_reserved_bytes() which adds the actual extent
+ * length to ->bytes_reserved, and subtracts the reserved space from
+ * ->bytes_may_use.
+ *
+ * For #2 this is trickier. Once the ordered extent runs we will drop the
+ * extent in the range we are overwriting, which creates a delayed ref for
+ * that freed extent. This however is not reclaimed until the transaction
+ * commits, thus the next stages.
+ *
+ * RUN_DELAYED_IPUTS
+ * If we are freeing inodes, we want to make sure all delayed iputs have
+ * completed, because they could have been on an inode with i_nlink == 0, and
+ * thus have been trunated and free'd up space. But again this space is not
+ * immediately re-usable, it comes in the form of a delayed ref, which must be
+ * run and then the transaction must be committed.
+ *
+ * FLUSH_DELAYED_REFS
+ * The above two cases generate delayed refs that will affect
+ * ->total_bytes_pinned. However this counter can be inconsistent with
+ * reality if there are outstanding delayed refs. This is because we adjust
+ * the counter based on the other delayed refs that exist. So for example, if
IMO this sentence would be clearer if it simply says something along the
lines of " This is because we adjust the counter based solely on the
current set of delayed refs and disregard any on-disk state which might
include more refs".
+ * we have an extent with 2 references, but we only drop 1, we'll see that
+ * there is a negative delayed ref count for the extent and assume that the
+ * space will be free'd, and thus increase ->total_bytes_pinned.
+ *
+ * Running the delayed refs gives us the actual real view of what will be
+ * freed at the transaction commit time. This stage will not actually free
+ * space for us, it just makes sure that may_commit_transaction() has all of
+ * the information it needs to make the right decision.
Is there any particular reason why total_bytes_pinned is sort of double
accounted. I.e first add_pinned_bytes is called when a DROP del ref is
added with negative ref. Then during processing of that delref
__btrfs_free_extent either:
a) Removes the ref but doesn't free the extent if there were other,
non-del refs for this extent
b) Removes the extent and calls btrfs_update_block_group to account it
again as pinned (this time also setting the respective ranges as pinned)
This double accounting doesn't really happen because after the
processing of the DROP del ref is finished
cleanup_ref_head->btrfs_cleanup_ref_head_accounting will actually clean
the pinned bytes added at creation time of the DROP del ref. Can we
avoid this and simply rely on the update of total_bytes_pinned when an
extent is freed.
We discussed this offline, just replying here for those playing along at home.
Now that everything is unified we no longer need this extra total_bytes_pinned
accounting. Before this is what we were relying on in the data path to account
for any data that may be freed once delayed refs run, as we didn't run delayed
refs in the data reservation path.
Now that we are in fact doing this, we do not need this extra accounting. In
fact at the point we check in may_commit_transaction() our
space_info->bytes_pinned will be uptodate, and thus we don't need this percpu
counter at all. Nikolay is going to follow up after these patches are merged
and remove the counter altogether. Thanks,
Josef