On 1/7/20 6:08 AM, Qu Wenruo wrote:
On 2020/1/7 上午12:50, Josef Bacik wrote:
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Isn't that caused by scrubbing creating unnecessary system chunks?
IIRC I had a patch to address that problem by just simply not allocating
system chunks for scrub.
("btrfs: scrub: Don't check free space before marking a block group RO")
This addresses the symptoms, not the root cause of the problem. Your fix is
valid, because we probably shouldn't be doing that, but we also shouldn't be
forcing restriping of block groups arbitrarily.
Although that doesn't address the whole problem, but it should at least
reduce the possibility.
Furthermore, with the newer over-commit behavior for inc_block_group_ro
("btrfs: use btrfs_can_overcommit in inc_block_group_ro"), we won't
really allocate new system chunks anymore if we can over-commit.
With those two patches, I guess we should have solved the problem.
Or did I miss something?
You are missing that we're getting forced to allocate a system chunk from this
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
which you move down in your patch, but will still get tripped by rebalance. So
you sort of paper over the real problem, we just don't get bitten by it as hard
with 061 because balance takes longer than scrub does. If we let it run longer
per fs type we'd still hit the same problem.
In short, your patches do make it better, and are definitely correct because we
probably shouldn't be allocating new chunks for scrub, but they don't address
the real cause of the problem. All the patches are needed. Thanks,
Josef