Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@xxxxxxxx> wrote:

What a great subject. The "reflink:" part is unnecessary, since the
rest of the subject already mentions it, that makes it a bit shorter.

>
> [BUG]
> The following command can lead to unexpected data COW:
>
>   #!/bin/bash
>
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
>
>   mkfs.btrfs -f $dev -b 1G > /dev/null
>   mount $dev $mnt -o nospace_cache
>
>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>   umount $dev
>
> The result extent will be
>
>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>                 generation 6 type 2 (prealloc)
>                 prealloc data disk byte 13631488 nr 28672
>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>                 generation 6 type 1 (regular)
>                 extent data disk byte 13660160 nr 12288 <<< COW
>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>                 generation 6 type 2 (prealloc)
>                 prealloc data disk byte 13631488 nr 28672
>
> Currently we always reserve space even for NOCOW buffered write, thus

I would add 'data' between 'reserve' and 'space', to be clear.

> under most case it shouldn't cause anything wrong even we fall back to
> COW.

even we ... -> even if we fallback to COW when running delalloc /
starting writeback.

>
> However when we're out of data space, we fall back to skip data space if
> we can do NOCOW write.

we fall back to skip data space ... -> we fallback to NOCOW write
without reserving data space.

>
> If such behavior happens under that case, we could hit the following
> problems:

> - data space bytes_may_use underflow
>   This will cause kernel warning.

Ok.

>
> - ENOSPC at delalloc time

at delalloc time - that is an ambiguous term you use through the change log.
You mean when running/starting delalloc, which happens when starting writeback,
but that could be confused with creating delalloc, which happens
during the buffered write path.

So I would always replace 'dealloc time' with 'when running delalloc'
(or when starting writeback).

>   This will lead to transaction abort and fs forced to RO.

Where does that happen exactly?
I don't recall starting transactions when running dealloc, and failed
to see where after a quick glance to cow_file_range()
and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
when starting writeback.

>
> [CAUSE]
> This is due to the fact that btrfs can only do extent level share check.
>
> Btrfs can only tell if an extent is shared, no matter if only part of the
> extent is shared or not.
>
> So for above script we have:
> - fallocate
> - buffered write
>   If we don't have enough data space, we fall back to NOCOW check.
>   At this timming, the extent is not shared, we can skip data
>   reservation.

But in the above example we don't fall to nocow mode when doing the
buffered write, as there's plenty of data space available (1Gb -
24Kb).
You need to update the example.


> - reflink
>   Now part of the large preallocated extent is shared.
> - delalloc kicks in

writeback kicks in

>   For the NOCOW range, as the preallocated extent is shared, we need
>   to fall back to COW.
>
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not the reflink source range) get flushed before reflink.

not the reflink source range -> not just the reflink source range

>
> However it's pretty expensive to do a comprehensive check.
> In the reproducer, the reflink source is just a part of a larger

Again, the reproducer needs to be fixed (yes, I tested it even if it's
clear by looking at it that it doesn't trigger the nocow case).

> preallocated extent, we need to flush all buffered write of that extent
> before reflink.
> Such backward search can be complex and we may not get much benefit from
> it.
>
> So this patch will just try to flush the whole inode before reflink.


>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
> Reason for RFC:
> Flushing an inode just because it's a reflink source is definitely
> overkilling, but I don't have any better way to handle it.
>
> Any comment on this is welcomed.
> ---
>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7755b503b348..8caa0edb6fbf 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>                         return ret;
>         }
>
> +       /*
> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> +        *
> +        * Due to the limit of btrfs extent tree design, we can only have
> +        * extent level share view. Any part of an extent is shared then the

Any -> If any

> +        * whole extent is shared and any write into that extent needs to fall

is -> is considered

> +        * back to COW.

I would add, something like:

That is, btrfs' back references do not have a block level granularity,
they work at the whole extent level.

> +        *
> +        * NOCOW buffered write without data space reserved could to lead to
> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> +        * at delalloc time (transaction abort).

I would omit the warning and transaction abort parts, that can change
any time. And we have that information in the changelog, so it's not
lost.

> +        *
> +        * Here we take a shortcut by flush the whole inode. We could do better
> +        * by finding all extents in that range and flush the space referring
> +        * all those extents.
> +        * But that's too complex for such corner case.
> +        */
> +       filemap_flush(src->i_mapping);
> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> +                    &BTRFS_I(src)->runtime_flags))
> +               filemap_flush(src->i_mapping);

So a few comments here:

- why just in the clone part? The dedupe side has the same problem, doesn't it?

- I would move such flushing to btrfs_remap_file_range_prep - this is
where we do the source and target range flush and wait.

Can you turn the reproducer into an fstests case?

Thanks.

> +
>         /*
>          * Lock destination range to serialize with concurrent readpages() and
>          * source range to serialize with relocation.
> --
> 2.21.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux