Re: [PATCH v2 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2020/7/6 下午9:41, Josef Bacik wrote:
> On 7/3/20 2:19 AM, Qu Wenruo wrote:
>> [PROBLEM]
>> There are known problem related to how btrfs handles qgroup reserved
>> space.
>> One of the most obvious case is the the test case btrfs/153, which do
>> fallocate, then write into the preallocated range.
>>
>>    btrfs/153 1s ... - output mismatch (see
>> xfstests-dev/results//btrfs/153.out.bad)
>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>> 20:24:40.730000089 +0800
>>        @@ -1,2 +1,5 @@
>>         QA output created by 153
>>        +pwrite: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>         Silence is golden
>>        ...
>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>
>> [CAUSE]
>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>> to"),
>> we always reserve space no matter if it's COW or not.
>>
>> Such behavior change is mostly for performance, and reverting it is not
>> a good idea anyway.
>>
>> For preallcoated extent, we reserve qgroup data space for it already,
>> and since we also reserve data space for qgroup at buffered write time,
>> it needs twice the space for us to write into preallocated space.
>>
>> This leads to the -EDQUOT in buffered write routine.
>>
>> And we can't follow the same solution, unlike data/meta space check,
>> qgroup reserved space is shared between data/meta.
>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>> check after qgroup reservation failure is not a solution.
>>
>> [FIX]
>> To solve the problem, we don't return -EDQUOT directly, but every time
>> we got a -EDQUOT, we try to flush qgroup space by:
>> - Flush all inodes of the root
>>    NODATACOW writes will free the qgroup reserved at run_dealloc_range().
>>    However we don't have the infrastructure to only flush NODATACOW
>>    inodes, here we flush all inodes anyway.
>>
>> - Wait ordered extents
>>    This would convert the preallocated metadata space into per-trans
>>    metadata, which can be freed in later transaction commit.
>>
>> - Commit transaction
>>    This will free all per-trans metadata space.
>>
>> Also we don't want to trigger flush too racy, so here we introduce a
>> per-root mutex to ensure if there is a running qgroup flushing, we wait
>> for it to end and don't start re-flush.
>>
>> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>>   fs/btrfs/ctree.h   |   1 +
>>   fs/btrfs/disk-io.c |   1 +
>>   fs/btrfs/qgroup.c  | 118 ++++++++++++++++++++++++++++++++++++++++-----
>>   3 files changed, 108 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 4dd478b4fe3a..891f47c7891f 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1162,6 +1162,7 @@ struct btrfs_root {
>>       spinlock_t qgroup_meta_rsv_lock;
>>       u64 qgroup_meta_rsv_pertrans;
>>       u64 qgroup_meta_rsv_prealloc;
>> +    struct mutex qgroup_flushing_mutex;
>>         /* Number of active swapfiles */
>>       atomic_t nr_swapfiles;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index c27022f13150..0116e0b487c9 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1116,6 +1116,7 @@ static void __setup_root(struct btrfs_root
>> *root, struct btrfs_fs_info *fs_info,
>>       mutex_init(&root->log_mutex);
>>       mutex_init(&root->ordered_extent_mutex);
>>       mutex_init(&root->delalloc_mutex);
>> +    mutex_init(&root->qgroup_flushing_mutex);
>>       init_waitqueue_head(&root->log_writer_wait);
>>       init_waitqueue_head(&root->log_commit_wait[0]);
>>       init_waitqueue_head(&root->log_commit_wait[1]);
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 398de1145d2b..bf9076bc33eb 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -3481,13 +3481,14 @@ static int qgroup_revert(struct btrfs_inode
>> *inode,
>>           int clear_ret;
>>             entry = rb_entry(n, struct ulist_node, rb_node);
>> -        if (entry->val >= start + len)
>> -            break;
>> -        if (entry->val + entry->aux <= start)
>> -            goto next;
>>           start = entry->val;
>>           end = entry->aux;
>>           len = end - start + 1;
>> +
>> +        if (start >= start + len)
>> +            break;
>> +        if (start + len <= start)
>> +            goto next;
> 
> This is unrelated to the change at hand, put this in the 1st patch.
> 
>>           /*
>>            * Now the entry is in [start, start + len), revert the
>>            * EXTENT_QGROUP_RESERVED bit.
>> @@ -3512,17 +3513,62 @@ static int qgroup_revert(struct btrfs_inode
>> *inode,
>>   }
>>     /*
>> - * Reserve qgroup space for range [start, start + len).
>> + * Try to free some space for qgroup.
>>    *
>> - * This function will either reserve space from related qgroups or doing
>> - * nothing if the range is already reserved.
>> + * For qgroup, there are only 3 ways to free qgroup space:
>> + * - Flush nodatacow write
>> + *   Any nodatacow write will free its reserved data space at
>> + *   run_delalloc_range().
>> + *   In theory, we should only flush nodatacow inodes, but it's
>> + *   not yet possible, so we need to flush the whole root.
>>    *
>> - * Return 0 for successful reserve
>> - * Return <0 for error (including -EQUOT)
>> + * - Wait for ordered extents
>> + *   When ordered extents are finished, their reserved metadata
>> + *   is finally converted to per_trans status, which can be freed
>> + *   by later commit transaction.
>>    *
>> - * NOTE: this function may sleep for memory allocation.
>> + * - Commit transaction
>> + *   This would free the meta_per_trans space.
>> + *   In theory this shouldn't provide much space, but any more qgroup
>> space
>> + *   is needed.
>>    */
>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>> +static int try_flush_qgroup(struct btrfs_root *root)
>> +{
>> +    struct btrfs_trans_handle *trans;
>> +    int ret;
>> +
>> +    if (!is_fstree(root->root_key.objectid))
>> +        return 0;
>> +
> 
> This will just retry the reservation again without having done
> anything.  Is it even possible to end up here with a !fs_tree?

You're right, it's impossible.

The existing btrfs_qgroup_reserve_data() has extra check for root
objectid, and just return 0 for non fs trees.
So it's impossible to hit the !is_fstree() branch.

>  If it is
> then maybe we should return -EDQUOT here so we don't retry the
> reservation again.
> 
>> +    /*
>> +     * We don't want to run flush again and again, so if there is a
>> running
>> +     * one, we won't try to start a new flush, but exit directly.
>> +     */
>> +    ret = mutex_trylock(&root->qgroup_flushing_mutex);
>> +    if (!ret) {
>> +        mutex_lock(&root->qgroup_flushing_mutex);
>> +        mutex_unlock(&root->qgroup_flushing_mutex);
>> +        return 0;
>> +    }
>> +
>> +    ret = btrfs_start_delalloc_snapshot(root);
>> +    if (ret < 0)
>> +        goto unlock;
>> +    btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>> +
>> +    trans = btrfs_join_transaction(root);
>> +    if (IS_ERR(trans)) {
>> +        ret = PTR_ERR(trans);
>> +        goto unlock;
>> +    }
>> +
>> +    ret = btrfs_commit_transaction(trans);
>> +unlock:
>> +    mutex_unlock(&root->qgroup_flushing_mutex);
>> +    return ret;
>> +}
> 
> I still hate this but implementing tickets for quotas probably doesn't
> make sense right now.

Yep. But I'm pretty open to integrating this into the tickets system.

Thanks,
Qu
> 
> <snip>
> 
>> +
>> +    ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
>> +    if (!ret)
>> +        return ret;
>> +    if (ret < 0 && ret != -EDQUOT)
>> +        return ret;
> 
> if (ret <= 0 && ret != -EDQUOT)
>     return ret;
> 
> Thanks,
> 
> Josef
> 




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux