Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 6.05.19 г. 5:04 ч., Qu Wenruo wrote:
> [snip]
>>>
>>> For data writeback, it should only cause sync related failure.
>>
>> Ok, so please remove the transaction abort comments for next iteration.
>> By sync related failure, you mean running dealloc fails with ENOSPC,
>> since when it tries to reserve a data extent it fails as it can't find
>> any free extent.
> 
> Well, btrfs has some hidden way to fix such problem by itself already.
> 
> At buffered write time, we have the following call chain:
> btrfs_buffered_write()
> |- btrfs_check_data_free_space()
>    |- btrfs_alloc_data_chunk_ondemand()
>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>       |- do_chunk_alloc() /* we have no data space available */
>       |- if (ret < 0 && ret == -ENOSPC)
>       |-     goto commit_trans;  /* try to free some space */
>       |- commit_trans:
>       |-     need_commit--;
>       |-     if (need_commit > 0) {
>       |-         btrfs_start_delalloc_roots();
>       |-         btrfs_wait_ordered_roots();
>       |-     }
> 
> This means, as long as we hit ENOSPC for data space, we will start write
> back, thus NODATACOW data will still hit disk as NODATACOW.

I'm lost for words at expressing how subtle and despicable that code is
... Is there a way to factor that out and make it more explicit ? I
don't think we should rely on such subtleties...

> 
> Such hidden behavior along with always-reserve-data-space solves the
> problem pretty well.
> We either:
> - reserve data space
>   Then no matter how it ends, we're OK, although it may end as CoW.
> 
> - Failed to reserve data space
>   Writeback will be triggered anyway, no way to screw things around.
> 
> Thus this workaround has nothing to fix, but only make certain NODATACOW
> reach disk as NODATACOW.
> 
> It makes some NODATACOW behaves more correctly but won't fix any obvious
> bug.
> 
> My personal take is to fix any strange behavior even it won't cause any
> problem, but the full inode writeback can be performance heavy.
> 
> So my question is, do we really need this anyway?
> 
> Thanks,
> Qu
> 
>>
>>>
>>>> I don't recall starting transactions when running dealloc, and failed
>>>> to see where after a quick glance to cow_file_range()
>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>> when starting writeback.
>>>>
>>>>>
>>>>> [CAUSE]
>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>
>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>> extent is shared or not.
>>>>>
>>>>> So for above script we have:
>>>>> - fallocate
>>>>> - buffered write
>>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>>   At this timming, the extent is not shared, we can skip data
>>>>>   reservation.
>>>>
>>>> But in the above example we don't fall to nocow mode when doing the
>>>> buffered write, as there's plenty of data space available (1Gb -
>>>> 24Kb).
>>>> You need to update the example.
>>> I have to admit that the core part is mostly based on the worst case
>>> *assumption*.
>>>
>>> I'll try to make the case convincing by making it fail directly.
>>
>> Great, thanks.
>>
>>>
>>>>
>>>>
>>>>> - reflink
>>>>>   Now part of the large preallocated extent is shared.
>>>>> - delalloc kicks in
>>>>
>>>> writeback kicks in
>>>>
>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>>   to fall back to COW.
>>>>>
>>>>> [WORKAROUND]
>>>>> The workaround is to ensure any buffered write in the related extents
>>>>> (not the reflink source range) get flushed before reflink.
>>>>
>>>> not the reflink source range -> not just the reflink source range
>>>>
>>>>>
>>>>> However it's pretty expensive to do a comprehensive check.
>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>
>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>
>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>> before reflink.
>>>>> Such backward search can be complex and we may not get much benefit from
>>>>> it.
>>>>>
>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>>>>> ---
>>>>> Reason for RFC:
>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>> overkilling, but I don't have any better way to handle it.
>>>>>
>>>>> Any comment on this is welcomed.
>>>>> ---
>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>  1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>                         return ret;
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>> +        *
>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>>> +        * extent level share view. Any part of an extent is shared then the
>>>>
>>>> Any -> If any
>>>>
>>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>>
>>>> is -> is considered
>>>>
>>>>> +        * back to COW.
>>>>
>>>> I would add, something like:
>>>>
>>>> That is, btrfs' back references do not have a block level granularity,
>>>> they work at the whole extent level.
>>>>
>>>>> +        *
>>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>> +        * at delalloc time (transaction abort).
>>>>
>>>> I would omit the warning and transaction abort parts, that can change
>>>> any time. And we have that information in the changelog, so it's not
>>>> lost.
>>>>
>>>>> +        *
>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>>> +        * by finding all extents in that range and flush the space referring
>>>>> +        * all those extents.
>>>>> +        * But that's too complex for such corner case.
>>>>> +        */
>>>>> +       filemap_flush(src->i_mapping);
>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>>> +               filemap_flush(src->i_mapping);
>>>>
>>>> So a few comments here:
>>>>
>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>
>>> Right.
>>>
>>>>
>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>> where we do the source and target range flush and wait.
>>>>
>>>> Can you turn the reproducer into an fstests case?
>>>
>>> Sure.
>>>
>>> Thanks for the info and all the comment,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>> +
>>>>>         /*
>>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>>          * source range to serialize with relocation.
>>>>> --
>>>>> 2.21.0
>>>>>
>>>>
>>>>
>>>
>>
>>
> 



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux