On 2019/11/28 上午2:59, David Sterba wrote:
> On Mon, Nov 25, 2019 at 06:04:50PM +0800, Qu Wenruo wrote:
>> [BUG]
>> For certain fs layout, a full balance can cause reproducible ENOSPC.
>> With enospc_debug, we got the following dmesg (BTRFS info and device
>> info ommitted to save some space):
>>
>> disk space caching is enabled
>> has skinny extents
>> balance: start -d -m -s
>> relocating block group 1104150528 flags data
>> found 14659 extents
>> found 14659 extents
>> unable to make block group 30408704 ro <<< from inc_block_group_ro()
>> sinfo_used=2386411520 bg_num_bytes=1046888448 min_allocable=1048576
>> space_info 4 has 18446744072434089984 free, is not full
>> space_info total=1073741824, used=24281088, pinned=1277952, reserved=1245184, may_use=2322333696, readonly=65536
>> global_block_rsv: size 3407872 reserved 3407872
>> trans_block_rsv: size 0 reserved 0
>> chunk_block_rsv: size 0 reserved 0
>> delayed_block_rsv: size 0 reserved 0
>> delayed_refs_rsv: size 2318401536 reserved 2318401536
>> unable to make block group 30408704 ro <<< double inc_block_group_ro()
>> failure, means
>> btrfs_inc_block_group_ro() failed
>> sinfo_used=2342912000 bg_num_bytes=1046872064 min_allocable=1048576
>> space_info 4 has 18446744072726380544 free, is not full
>> space_info total=1342177280, used=24281088, pinned=1277952, reserved=1245184, may_use=2298478592, readonly=65536
>> global_block_rsv: size 3407872 reserved 3407872
>> trans_block_rsv: size 0 reserved 0
>> chunk_block_rsv: size 393216 reserved 393216
>> delayed_block_rsv: size 0 reserved 0
>> delayed_refs_rsv: size 2294546432 reserved 2294546432
>> ...
>> 1 enospc errors during balance
>> balance: ended with status: -28
>>
>> [CAUSE]
>> When allocating block group 1104150528, since that block group has a lot
>> of extents, it has a data reloc inode with a lot of extents (14659
>> non-hole data extents).
>>
>> After relocating that block group, btrfs needs to cleanup the data reloc
>> inode.
>>
>> During that inode eviction, we have call evict_refill_and_join() to get
>> metadata space reserved, which will cause a lot of metadata
>> bytes_may_use:
>> evict_refill_and_join()
>> |- btrfs_block_rsv_refill()
>> |- btrfs_reserve_metadata_bytes()
>> |- __reserve_metadata_bytes()
>> |- if (can_overcommit() || ...) {
>> | btrfs_space_info_update_bytes_may_use();
>> | ret = 0;
>> | }
>> | if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
>> | return ret;
>> | return handle_reserve_ticket();
>>
>> That means, if we can can_overcommit(), we will increase bytes_may_use()
>> anyway.
>> And only when we failed to over-commit, handle_reserve_ticket() get
>> triggered to reclaim some space.
>>
>> On the other hand, at btrfs_inc_block_group_ro(), we will check if we
>> have enough space, and if not, allocate a chunk and retry:
>> btrfs_inc_block_group_ro()
>> |- ret = inc_block_group_ro(cache, 0);
>> | |- if (sinfo_used + num_bytes + min_allocable_bytes <=
>> | | sinfo->total_bytes)
>> | | ret = 0; # Only success if we have enough space.
>> |- ret = btrfs_alloc_chunk(); # Trigger a chunk allocation
>> |- ret = inc_block_group_ro(cache, 0);
>> |- Do the same check again.
>>
>> That means, if above over-commit threshold is larger than current space
>> + 1 more chunk, btrfs will continue over-commit, causing very large
>> bytes_may_use just like the enospc debug output:
>> space_info total=1073741824, used=24281088, ..., may_use=2322333696
>> ^^^^^^^^^^
>> The fs is 25G, DUP metadata, so the over-commit threshold can be as
>> large as 6G.
>> In our case may_use is over 2.3G, while our metadata space info is only 1G.
>> Definitely will not pass the check in btrfs_inc_block_group_ro().
>>
>> Such over-commit behavior works fine for most use cases, but when
>> btrfs_inc_block_group_ro() is involved, we will get ENOSPC.
>>
>> [FIX]
>> Change can_overcommit() threshold, to follow the
>> btrfs_inc_block_group_ro() behavior.
>>
>> Adds a new threshold check based on chunk size, so if our used bytes
>> (including bytes_may_use) exceeds current space info + 87.5% one chunk size,
>> we stop over-commit.
>>
>> The 87.5% is used as extra headroom for min_allocable_bytes (SZ_1M).
>>
>> This makes over-commit work along with btrfs_inc_block_group_ro().
>>
>> The downside is, we will have much smaller over-commit threshold.
>> This means, when fs is mostly empty, performance may drop compared to
>> the old behavior.
>>
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> Reason for RFC:
>> This is another extreme, compared to "[RFC PATCH] btrfs: Commit
>> transaction to workaround ENOSPC during relocation".
>>
>> This patch will reduce commit threshold for all cases, just to address
>> one case in relocation.
>>
>> While the other RFC just address one problem, and one problem only,
>> but in a whac-a-hole fashion.
>>
>> I don't know which is better, personally speaking, that whac-a-hole
>> patch may be a little better.
>>
>> So both patches are with RFC tag.
>
> This patch is superseded by the series from Josef, right?
>
Right, please discard both RFCs.
Thanks,
Qu
Attachment:
signature.asc
Description: OpenPGP digital signature
