On 2020/6/8 下午11:17, Josef Bacik wrote:
> On 6/7/20 3:25 AM, Qu Wenruo wrote:
>> [BUG]
>> The following simple workload from fsstress can lead to qgroup reserved
>> data space leakage:
>> 0/0: creat f0 x:0 0 0
>> 0/0: creat add id=0,parent=-1
>> 0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
>> 0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318]
>> return 25, fallback to stat()
>> 0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
>>
>> This would cause btrfs qgroup to leak 20480 bytes for data reserved
>> space.
>> If btrfs qgroup limit is enabled, such leakage can lead to unexpected
>> early EDQUOT and unusable space.
>>
>> [CAUSE]
>> When doing direct IO, kernel will try to writeback existing buffered
>> page cache, then invalidate them:
>> iomap_dio_rw()
>> |- filemap_write_and_wait_range();
>> |- invalidate_inode_pages2_range();
>>
>> However for btrfs, the bi_end_io hook doesn't finish all its heavy work
>> right after bio ends.
>> In fact, it delays its work further:
>> submit_extent_page(end_io_func=end_bio_extent_writepage);
>> end_bio_extent_writepage()
>> |- btrfs_writepage_endio_finish_ordered()
>> |- btrfs_init_work(finish_ordered_fn);
>>
>> <<< Work queue execution >>>
>> finish_ordered_fn()
>> |- btrfs_finish_ordered_io();
>> |- Clear qgroup bits
>>
>> This means, when filemap_write_and_wait_range() returns,
>> btrfs_finish_ordered_io() is not ensured to be executed, thus the
>> qgroup bits for related range is not cleared.
>>
>> Now into how the leakage happens, this will only focus on the
>> overlapping part of buffered and direct IO part.
>>
>> 1. After buffered write
>> The inode had the following range with QGROUP_RESERVED bit:
>> 596 616K
>> |///////////////|
>> Qgroup reserved data space: 20K
>>
>> 2. Writeback part for range [596K, 616K)
>> Write back finished, but btrfs_finish_ordered_io() not get called
>> yet.
>> So we still have:
>> 596K 616K
>> |///////////////|
>> Qgroup reserved data space: 20K
>>
>> 3. Pages for range [596K, 616K) get released
>> This will clear all qgroup bits, but don't update the reserved data
>> space.
>> So we have:
>> 596K 616K
>> | |
>> Qgroup reserved data space: 20K
>> That number doesn't match with the qgroup bit range anymore.
>>
>> 4. Dio prepare space for range [596K, 700K)
>> Qgroup reserved data space for that range, we got:
>> 596K 616K 700K
>> |///////////////|///////////////////////|
>> Qgroup reserved data space: 20K + 104K = 124K
>>
>> 5. btrfs_finish_ordered_range() get executed for range [596K, 616K)
>> Qgroup free reserved space for that range, we got:
>> 596K 616K 700K
>> | |///////////////////////|
>> We need to free that range of reserved space.
>> Qgroup reserved data space: 124K - 20K = 104K
>>
>> 6. btrfs_finish_ordered_range() get executed for range [596K, 700K)
>> However qgroup bit for range [596K, 616K) is already cleared in
>> previous step, so we only free 84K for qgroup reserved space.
>> 596K 616K 700K
>> | | |
>> We need to free that range of reserved space.
>> Qgroup reserved data space: 104K - 84K = 20K
>>
>> Now there is no way to release that 20K unless disabling qgroup or
>> unmount the fs.
>>
>> [FIX]
>> This patch will fix the problem by calling btrfs_qgroup_free_data() when
>> a page is released.
>>
>> So that even a dirty page is released, its qgroup reserved data space
>> will get freed along with it.
>>
>> Fixes: f695fdcef83a ("btrfs: qgroup: Introduce functions to
>> release/free qgroup reserve data space")
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>
> This seems backwards to me, and not in keeping with the actual lifetime
> of the changes. At the point that the ordered extent is created it is
> now in charge of the qgroup reservation, so it should be the ultimate
> arbiter of what is done with that qgroup reservation. So fix
> try_release_extent_state to not remove EXTENT_QGROUP_RESERVED, because
> it's going to get dropped elsewhere. Thanks,
Indeed, doing the qgroup rsv work in ordered extent looks more reasonable.
Although that change would make a lot of timing completely different,
and won't go as smooth in the first run, it still looks like a more
proper fix.
Thanks for the advice,
Qu
>
> Josef
Attachment:
signature.asc
Description: OpenPGP digital signature
