On Fri, Oct 25, 2019 at 04:59:56PM +0800, Qu Wenruo wrote:
> [BUG]
> When deleting large files (which cross block group boundary) with discard
> mount option, we find some btrfs_discard_extent() calls only trimmed part
> of its space, not the whole range:
>
> btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
>
> type: bbio->map_type, in above case, it's SINGLE DATA.
> start: Logical address of this trim
> len: Logical length of this trim
> trimmed: Physically trimmed bytes
> ratio: trimmed / len
>
> Thus leading some unused space not discarded.
>
> [CAUSE]
> When discard mount option is specified, after a transaction is fully
> committed (super block written to disk), we begin to cleanup pinned
> extents in the following call chain:
>
> btrfs_commit_transaction()
> |- btrfs_finish_extent_commit()
> |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
> |- btrfs_discard_extent()
>
> However pinned extents are recorded in an extent_io_tree, which can
> merge adjacent extent states.
>
> When a large file get deleted and it has adjacent file extents across
> block group boundary, we will get a large merged range, like this:
>
> |<--- BG1 --->|<--- BG2 --->|
> |//////|<-- Range to discard --->|/////|
>
> To discard that range, we have the following calls:
> btrfs_discard_extent()
> |- btrfs_map_block()
> | Returned bbio will end at BG1's end. As btrfs_map_block()
> | never returns result across block group boundary.
> |- btrfs_issuse_discard()
> Issue discard for each stripe.
>
> So we will only discard the range in BG1, not the remaining part in BG2.
>
> Furthermore, this bug is not that reliably observed, for above case, if
> there is no other extent in BG2, BG2 will be empty and btrfs will trim
> all space of BG2, covering up the bug.
>
> [FIX]
> - Allow __btrfs_map_block_for_discard() to modify @length parameter
> btrfs_map_block() uses its @length paramter to notify the caller how
> many bytes are mapped in current call.
> With __btrfs_map_block_for_discard() also modifing the @length,
> btrfs_discard_extent() now understands when to do extra trim.
>
> - Call btrfs_map_block() in a loop until we hit the range end
> Since we now know how many bytes are mapped each time, we can iterate
> through each block group boundary and issue correct trim for each
> range.
>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
> fs/btrfs/extent-tree.c | 41 +++++++++++++++++++++++++++++++----------
> fs/btrfs/volumes.c | 6 ++++--
> 2 files changed, 35 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 49cb26fa7c63..ff2838bd677d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
> u64 num_bytes, u64 *actual_bytes)
> {
> - int ret;
> + int ret = 0;
> u64 discarded_bytes = 0;
> + u64 end = bytenr + num_bytes;
> + u64 cur = bytenr;
> struct btrfs_bio *bbio = NULL;
>
>
> @@ -1316,15 +1318,23 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
> * associated to its stripes that don't go away while we are discarding.
> */
> btrfs_bio_counter_inc_blocked(fs_info);
> - /* Tell the block device(s) that the sectors can be discarded */
> - ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, &num_bytes,
> - &bbio, 0);
> - /* Error condition is -ENOMEM */
> - if (!ret) {
> - struct btrfs_bio_stripe *stripe = bbio->stripes;
> + while (cur < end) {
> + struct btrfs_bio_stripe *stripe;
> int i;
>
> + num_bytes = end - cur;
> + /* Tell the block device(s) that the sectors can be discarded */
> + ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur,
> + &num_bytes, &bbio, 0);
> + /*
> + * Error can be -ENOMEM, -ENOENT (no such chunk mapping) or
> + * -EOPNOTSUPP. For any such error, @num_bytes is not updated,
> + * thus we can't continue anyway.
> + */
> + if (ret < 0)
> + goto out;
>
> + stripe = bbio->stripes;
> for (i = 0; i < bbio->num_stripes; i++, stripe++) {
> u64 bytes;
> struct request_queue *req_q;
> @@ -1341,10 +1351,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
> stripe->physical,
> stripe->length,
> &bytes);
> - if (!ret)
> + if (!ret) {
> discarded_bytes += bytes;
> - else if (ret != -EOPNOTSUPP)
> - break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
> + } else if (ret != -EOPNOTSUPP) {
> + /*
> + * Logic errors or -ENOMEM, or -EIO but I don't
> + * know how that could happen JDM
> + *
> + * Ans since there are two loops, explicitly
Hate for there to be a v5 at this point, but it should be "and". Anyway
Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
Thanks,
Josef