Re: [PATCH] btrfs: Remove the duplicated and sometimes too cautious btrfs_can_relocate()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/7/18 下午7:18, Nikolay Borisov wrote:
>
>
> On 18.07.19 г. 8:48 ч., Qu Wenruo wrote:
>> [BUG]
>> The following script can easily cause unexpected ENOSPC:
>>   umount $dev &> /dev/null
>>   umount $mnt &> /dev/null
>>
>>   mkfs.btrfs -b 1G -m single -d single $dev -f > /dev/null
>>
>>   mount $dev $mnt -o enospc_debug
>>
>>   for i in $(seq -w 0 511); do
>>   	xfs_io -f -c "pwrite 0 1m" $mnt/inline_$i > /dev/null
>>   done
>>   sync
>>
>>   btrfs balance start --full $mnt || return 1
>>   sync
>>
>>   # This will report -ENOSPC
>>   btrfs balance start --full $mnt || return 1
>>   umount $mnt
>>
>> Also, btrfs/156 can also fail due to ENOSPC.
>>
>> [CAUSE]
>> The ENOSPC is reported by btrfs_can_relocate().
>>
>> In btrfs_can_relocate(), it does the following check:
>> - If the block group is empty
>>   If empty, definitely we can relocate this block group.
>> - If we are not the only block group and we have enough space
>>   Then we can relocate this block group.
>>
>> Above two checks are completely OK, although I could argue they doesn't
>> make much sense, but the following check is vague and even sometimes
>> too cautious to cause ENOSPC:
>> - If we can allocate a new block group as large as current one.
>>   If we failed previous two checks, we must pass this to relocate this
>>   block group.
>
> btrfs_can_relocate chunk requires min_free to be allocatable.
> min_free is defined as the used space in the  block group being
> relocated, which I think is correct.

Yep, you and Filipe are completely right here.

I mischecked the @min_free.

But compared to the check in inc_block_group_ro(), it doesn't account
values like block group reserved (which is allocated for delayed ref,
but not committed to extent tree), pinned (commit tree blocks which is
not used in current trans) and super bytes (used by super block).

So the check is still not comprehensive.

> Also I find the logic which
> adjusts min_free and dev_min to also be correct. Finally the function
> checks whether the device's freespace is fragmented by trying to find a
> device chunk with the appropriate size. The question is - can we really
> have a device that has enough free space, yet is fragmented such that
> find_free_dev_extent fails which results in failing the allocation? I
> think the answer is no since we allocate in chunk granularity. What am I missing?

In fact, I have already hidden one more problem here.

At the timing of find_free_dev_extents() in btrfs_can_relocate(),
find_free_dev_extents() is not working properly as it always search
*commit* root of dev tree.

With extra dump tree, I found that commit root of dev tree at that
timing has an extra dev extent in commit tree but not in current root.
This is the root cause of the false ENOSPC.

I should explain the root cause and why at that timing calling
btrfs_can_relocate() is not reliable.

Anyway, for the removal part, it still makes sense.

>
>
> OTOH, in btrfs_inc_block_group_ro we only allocate a chunk if:
>
> a) we are changing raid profiles
> b) if inc_block_group_ro fails for our block group.
>
> And for b) I'm a bit puzzled as to what the code is supposed to mean. We have:
>
> num_bytes = cache->key.offset - cache->reserved - cache->pinned -
>                       cache->bytes_super - btrfs_block_group_used(&cache->item);
>           sinfo_used = btrfs_space_info_used(sinfo, true);
>
>           if (sinfo_used + num_bytes + min_allocable_bytes <=
>               sinfo->total_bytes) {
> //set ro
>
> }
>
> This means if the free space in the block group + the used space in the
> space info is smaller than the total space in
> the space info - make this block group RO. What's the rationale behind that?

Oh, this should be a similar check in btrfs_can_relocate(), but I was
confused by the complex check and even considered it correct after
several strange calculations.

What we really want is:

Available space in other block groups >= used/pinned/resved space in
this block_group + some buff

In code it should be
sinfo->avail - bg->available >= bg->used + buff

Then adds bg->available, we get
sinfo->avail >= bg->used + bg->available + buff

And bg->used + bg->available = bg->total and sinfo->avail = sinfo->total
- sinfo->used, we should get something like:

sinfo->total >= sinfo->used + bg->total + buff.
Compared to current one:
sinfo->total >= sinfo->used + bg->avail + buff.

In fact the current one is much easier to meet than the correct one.

I need to double check the calculation.

Thanks for pointing this out,
Qu




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux