Re: [PATCH v8 1/5] btrfs: Introduce per-profile available space facility

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 21.02.20 г. 8:11 ч., Qu Wenruo wrote:
> [PROBLEM]
> There are some locations in btrfs requiring accurate estimation on how
> many new bytes can be allocated on unallocated space.
> 
> We have two types of estimation:
> - Factor based calculation
>   Just use all unallocated space, divide by the profile factor
>   One obvious user is can_overcommit().
> 
> - Chunk allocator like calculation
>   This will emulate the chunk allocator behavior, to get a proper
>   estimation.
>   The only user is btrfs_calc_avail_data_space(), utilized by
>   btrfs_statfs().
>   The problem is, that function is not generic purposed enough, can't
>   handle things like RAID5/6.
> 
> Current factor based calculation can't handle the following case:
>   devid 1 unallocated:	1T
>   devid 2 unallocated:	10T
>   metadata type:	RAID1
> 
> If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
> metadata.
> But in fact we can only get 1T free space, as we're limited by the
> smallest device for RAID1.
> 
> [SOLUTION]
> This patch will introduce per-profile available space calculation,
> which can give an estimation based on chunk-allocator-like behavior.
> 
> The difference between it and chunk allocator is mostly on rounding and
> [0, 1M) reserved space handling, which shouldn't cause practical impact.
> 
> The newly introduced per-profile available space calculation will
> calculate available space for each type, using chunk-allocator like
> calculation.
> 
> With that facility, for above device layout we get the full available
> space array:
>   RAID10:	0  (not enough devices)
>   RAID1:	1T
>   RAID1C3:	0  (not enough devices)
>   RAID1C4:	0  (not enough devices)
>   DUP:		5.5T
>   RAID0:	2T
>   SINGLE:	11T
>   RAID5:	1T
>   RAID6:	0  (not enough devices)
> 
> Or for a more complex example:
>   devid 1 unallocated:	1T
>   devid 2 unallocated:  1T
>   devid 3 unallocated:	10T
> 
> We will get an array of:
>   RAID10:	0  (not enough devices)
>   RAID1:	2T
>   RAID1C3:	1T
>   RAID1C4:	0  (not enough devices)
>   DUP:		6T
>   RAID0:	3T
>   SINGLE:	12T
>   RAID5:	2T
>   RAID6:	0  (not enough devices)
> 
> And for the each profile , we go chunk allocator level calculation:
> The pseudo code looks like:
> 
>   clear_virtual_used_space_of_all_rw_devices();
>   do {
>   	/*
>   	 * The same as chunk allocator, despite used space,
>   	 * we also take virtual used space into consideration.
>   	 */
>   	sort_device_with_virtual_free_space();
> 
>   	/*
>   	 * Unlike chunk allocator, we don't need to bother hole/stripe
>   	 * size, so we use the smallest device to make sure we can
>   	 * allocated as many stripes as regular chunk allocator
>   	 */
>   	stripe_size = device_with_smallest_free->avail_space;
> 	stripe_size = min(stripe_size, to_alloc / ndevs);
> 
>   	/*
>   	 * Allocate a virtual chunk, allocated virtual chunk will
>   	 * increase virtual used space, allow next iteration to
>   	 * properly emulate chunk allocator behavior.
>   	 */
>   	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
>   	if (ret == 0)
>   		avail += allocated_size;
>   } while (ret == 0)
> 
> As we always select the device with least free space, the device with
> the most space will be the first to be utilized, just like chunk
> allocator.
> For above 1T + 10T device, we will allocate a 1T virtual chunk
> in the first iteration, then run out of device in next iteration.
> 
> Thus only get 1T free space for RAID1 type, just like what chunk
> allocator would do.
> 
> The patch will update such per-profile available space at the following
> timing:
> - Mount time
> - Chunk allocation
> - Chunk removal
> - Device grow
> - Device shrink
> - Device add
> - Device removal
> 
> Those timing are all protected by chunk_mutex, and what we do are only
> iterating in-memory only structures, no extra IO triggered, so the
> performance impact should be pretty small.
> 
> For the extra error handling, the principle is to keep the old behavior.
> That's to say, if old error handler would just return an error, then we
> follow it, no matter if the caller reverts the device size.
> 
> For the proper error handling, they will be added in later patches.
> As I don't want to make the core facility bloated by the error handling
> code, especially some situation needs quite some new code to handle
> errors.
> 
> Suggested-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>

Apart from one minor nit below:

Reviewed-by: Nikolay Borisov <nborisov@xxxxxxxx>

> ---
>  fs/btrfs/volumes.c | 217 ++++++++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/volumes.h |  10 +++
>  2 files changed, 207 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9cfc668f91f4..e04e3f6e55f4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1949,6 +1949,171 @@ static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
>  	return num_devices;
>  }
>  
<snip>

> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_raid_types type,
> +				  u64 *result_ret)
> +{
> +	struct btrfs_device_info *devices_info = NULL;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 allocated;
> +	u64 result = 0;
> +	int ret = 0;
> +
> +	lockdep_assert_held(&fs_info->chunk_mutex);
> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> +	/* Not enough devices, quick exit, just update the result */
> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> +		goto out;
> +
> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> +			       GFP_NOFS);
> +	if (!devices_info) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	/* Clear virtual chunk used space for each device */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +
> +	while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
> +		result += allocated;
> +
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;

nit: This second list_for_each_entry is redundant because
alloc_virtual_chunk is guaranteed to be called with zeroed
device->virtual_allocated because of the list_for_each_entry before the
while(). Furthermore this call path is under chunk_mutex so no one can
sneak in and change it .

<snip>



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux