On 17.04.20 г. 0:46 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@xxxxxx>
>
> In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
> we complete the ordered extent range. However, we don't mark that the
> range doesn't need to be cleaned up from btrfs_direct_IO() until later.
> Therefore, if we fail to allocate the btrfs_dio_private, we complete the
> ordered extent range twice. We could fix this by updating
> unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
> that creating the btrfs_dio_private and submitting the bios are
> separate, and once the btrfs_dio_private is created, cleanup always
> happens through the btrfs_dio_private.
>
> Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error")
> Signed-off-by: Omar Sandoval <osandov@xxxxxx>
Generally looks ok, so :
Reviewed-by: Nikolay Borisov <nborisov@xxxxxxxx>, however please see
below for some remarks on the logic around unsubmitted_oe_range_start/end
> ---
> fs/btrfs/inode.c | 174 ++++++++++++++++++-----------------------------
> 1 file changed, 66 insertions(+), 108 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b628c319a5b6..f6ce9749adb6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7903,14 +7903,60 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
> return ret;
> }
>
> -static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
> +/*
> + * If this succeeds, the btrfs_dio_private is responsible for cleaning up locked
> + * or ordered extents whether or not we submit any bios.
> + */
> +static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
> + struct inode *inode,
> + loff_t file_offset)
> {
> - struct inode *inode = dip->inode;
> + const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
> + struct btrfs_dio_private *dip;
> + struct bio *bio;
> +
> + dip = kzalloc(sizeof(*dip), GFP_NOFS);
> + if (!dip)
> + return NULL;
> +
> + bio = btrfs_bio_clone(dio_bio);
> + bio->bi_private = dip;
> + btrfs_io_bio(bio)->logical = file_offset;
> +
> + dip->private = dio_bio->bi_private;
> + dip->inode = inode;
> + dip->logical_offset = file_offset;
> + dip->bytes = dio_bio->bi_iter.bi_size;
> + dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
> + dip->orig_bio = bio;
> + dip->dio_bio = dio_bio;
> + atomic_set(&dip->pending_bios, 1);
> +
> + if (write) {
> + struct btrfs_dio_data *dio_data = current->journal_info;
> +
> + dio_data->unsubmitted_oe_range_end = dip->logical_offset +
> + dip->bytes;
> + dio_data->unsubmitted_oe_range_start =
> + dio_data->unsubmitted_oe_range_end;
The logic around those 2 members is really subtle. We really have the
following:
1. btrfs_direct_IO sets those two to the same value.
2. When we call __blockdev_direct_IO unless
btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
modify unsubmitted_oe_range_start so that start < end. Cleanup won't
happen.
3. We come into btrfs_submit_direct - if it dip allocation fails we'd
return with oe_range_end now modified so cleanup will happen.
4. If we manage to allocate the dip we reset the unsubmitted range
members to be equal so that cleanup happens from btrfs_endio_direct_write.
This 4-step logic is not really obvious, especially given it's scattered
across 3 functions. Perhaps a comment saying that having the 2 members
being equal means cleanup in btrfs_DIRECT_io is disabled.
I'd rather have it spelled out in the changelog, I guess David can also
do that ?
> +
> + bio->bi_end_io = btrfs_endio_direct_write;
> + } else {
> + bio->bi_end_io = btrfs_endio_direct_read;
> + dip->subio_endio = btrfs_subio_endio_read;
> + }
> + return dip;
> +}
> +
<snip>