On 6.05.19 г. 5:04 ч., Qu Wenruo wrote:
> [snip]
>>>
>>> For data writeback, it should only cause sync related failure.
>>
>> Ok, so please remove the transaction abort comments for next iteration.
>> By sync related failure, you mean running dealloc fails with ENOSPC,
>> since when it tries to reserve a data extent it fails as it can't find
>> any free extent.
>
> Well, btrfs has some hidden way to fix such problem by itself already.
>
> At buffered write time, we have the following call chain:
> btrfs_buffered_write()
> |- btrfs_check_data_free_space()
> |- btrfs_alloc_data_chunk_ondemand()
> |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
> |- do_chunk_alloc() /* we have no data space available */
> |- if (ret < 0 && ret == -ENOSPC)
> |- goto commit_trans; /* try to free some space */
> |- commit_trans:
> |- need_commit--;
> |- if (need_commit > 0) {
> |- btrfs_start_delalloc_roots();
> |- btrfs_wait_ordered_roots();
> |- }
>
> This means, as long as we hit ENOSPC for data space, we will start write
> back, thus NODATACOW data will still hit disk as NODATACOW.
I'm lost for words at expressing how subtle and despicable that code is
... Is there a way to factor that out and make it more explicit ? I
don't think we should rely on such subtleties...
>
> Such hidden behavior along with always-reserve-data-space solves the
> problem pretty well.
> We either:
> - reserve data space
> Then no matter how it ends, we're OK, although it may end as CoW.
>
> - Failed to reserve data space
> Writeback will be triggered anyway, no way to screw things around.
>
> Thus this workaround has nothing to fix, but only make certain NODATACOW
> reach disk as NODATACOW.
>
> It makes some NODATACOW behaves more correctly but won't fix any obvious
> bug.
>
> My personal take is to fix any strange behavior even it won't cause any
> problem, but the full inode writeback can be performance heavy.
>
> So my question is, do we really need this anyway?
>
> Thanks,
> Qu
>
>>
>>>
>>>> I don't recall starting transactions when running dealloc, and failed
>>>> to see where after a quick glance to cow_file_range()
>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>> when starting writeback.
>>>>
>>>>>
>>>>> [CAUSE]
>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>
>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>> extent is shared or not.
>>>>>
>>>>> So for above script we have:
>>>>> - fallocate
>>>>> - buffered write
>>>>> If we don't have enough data space, we fall back to NOCOW check.
>>>>> At this timming, the extent is not shared, we can skip data
>>>>> reservation.
>>>>
>>>> But in the above example we don't fall to nocow mode when doing the
>>>> buffered write, as there's plenty of data space available (1Gb -
>>>> 24Kb).
>>>> You need to update the example.
>>> I have to admit that the core part is mostly based on the worst case
>>> *assumption*.
>>>
>>> I'll try to make the case convincing by making it fail directly.
>>
>> Great, thanks.
>>
>>>
>>>>
>>>>
>>>>> - reflink
>>>>> Now part of the large preallocated extent is shared.
>>>>> - delalloc kicks in
>>>>
>>>> writeback kicks in
>>>>
>>>>> For the NOCOW range, as the preallocated extent is shared, we need
>>>>> to fall back to COW.
>>>>>
>>>>> [WORKAROUND]
>>>>> The workaround is to ensure any buffered write in the related extents
>>>>> (not the reflink source range) get flushed before reflink.
>>>>
>>>> not the reflink source range -> not just the reflink source range
>>>>
>>>>>
>>>>> However it's pretty expensive to do a comprehensive check.
>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>
>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>
>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>> before reflink.
>>>>> Such backward search can be complex and we may not get much benefit from
>>>>> it.
>>>>>
>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>>>>> ---
>>>>> Reason for RFC:
>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>> overkilling, but I don't have any better way to handle it.
>>>>>
>>>>> Any comment on this is welcomed.
>>>>> ---
>>>>> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>> 1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>> return ret;
>>>>> }
>>>>>
>>>>> + /*
>>>>> + * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>> + *
>>>>> + * Due to the limit of btrfs extent tree design, we can only have
>>>>> + * extent level share view. Any part of an extent is shared then the
>>>>
>>>> Any -> If any
>>>>
>>>>> + * whole extent is shared and any write into that extent needs to fall
>>>>
>>>> is -> is considered
>>>>
>>>>> + * back to COW.
>>>>
>>>> I would add, something like:
>>>>
>>>> That is, btrfs' back references do not have a block level granularity,
>>>> they work at the whole extent level.
>>>>
>>>>> + *
>>>>> + * NOCOW buffered write without data space reserved could to lead to
>>>>> + * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>> + * at delalloc time (transaction abort).
>>>>
>>>> I would omit the warning and transaction abort parts, that can change
>>>> any time. And we have that information in the changelog, so it's not
>>>> lost.
>>>>
>>>>> + *
>>>>> + * Here we take a shortcut by flush the whole inode. We could do better
>>>>> + * by finding all extents in that range and flush the space referring
>>>>> + * all those extents.
>>>>> + * But that's too complex for such corner case.
>>>>> + */
>>>>> + filemap_flush(src->i_mapping);
>>>>> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>> + &BTRFS_I(src)->runtime_flags))
>>>>> + filemap_flush(src->i_mapping);
>>>>
>>>> So a few comments here:
>>>>
>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>
>>> Right.
>>>
>>>>
>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>> where we do the source and target range flush and wait.
>>>>
>>>> Can you turn the reproducer into an fstests case?
>>>
>>> Sure.
>>>
>>> Thanks for the info and all the comment,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>> +
>>>>> /*
>>>>> * Lock destination range to serialize with concurrent readpages() and
>>>>> * source range to serialize with relocation.
>>>>> --
>>>>> 2.21.0
>>>>>
>>>>
>>>>
>>>
>>
>>
>