On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@xxxxxxxx> wrote: What a great subject. The "reflink:" part is unnecessary, since the rest of the subject already mentions it, that makes it a bit shorter. > > [BUG] > The following command can lead to unexpected data COW: > > #!/bin/bash > > dev=/dev/test/test > mnt=/mnt/btrfs > > mkfs.btrfs -f $dev -b 1G > /dev/null > mount $dev $mnt -o nospace_cache > > xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1 > xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1 > umount $dev > > The result extent will be > > item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53 > generation 6 type 2 (prealloc) > prealloc data disk byte 13631488 nr 28672 > item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53 > generation 6 type 1 (regular) > extent data disk byte 13660160 nr 12288 <<< COW > item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53 > generation 6 type 2 (prealloc) > prealloc data disk byte 13631488 nr 28672 > > Currently we always reserve space even for NOCOW buffered write, thus I would add 'data' between 'reserve' and 'space', to be clear. > under most case it shouldn't cause anything wrong even we fall back to > COW. even we ... -> even if we fallback to COW when running delalloc / starting writeback. > > However when we're out of data space, we fall back to skip data space if > we can do NOCOW write. we fall back to skip data space ... -> we fallback to NOCOW write without reserving data space. > > If such behavior happens under that case, we could hit the following > problems: > - data space bytes_may_use underflow > This will cause kernel warning. Ok. > > - ENOSPC at delalloc time at delalloc time - that is an ambiguous term you use through the change log. You mean when running/starting delalloc, which happens when starting writeback, but that could be confused with creating delalloc, which happens during the buffered write path. So I would always replace 'dealloc time' with 'when running delalloc' (or when starting writeback). > This will lead to transaction abort and fs forced to RO. Where does that happen exactly? I don't recall starting transactions when running dealloc, and failed to see where after a quick glance to cow_file_range() and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means when starting writeback. > > [CAUSE] > This is due to the fact that btrfs can only do extent level share check. > > Btrfs can only tell if an extent is shared, no matter if only part of the > extent is shared or not. > > So for above script we have: > - fallocate > - buffered write > If we don't have enough data space, we fall back to NOCOW check. > At this timming, the extent is not shared, we can skip data > reservation. But in the above example we don't fall to nocow mode when doing the buffered write, as there's plenty of data space available (1Gb - 24Kb). You need to update the example. > - reflink > Now part of the large preallocated extent is shared. > - delalloc kicks in writeback kicks in > For the NOCOW range, as the preallocated extent is shared, we need > to fall back to COW. > > [WORKAROUND] > The workaround is to ensure any buffered write in the related extents > (not the reflink source range) get flushed before reflink. not the reflink source range -> not just the reflink source range > > However it's pretty expensive to do a comprehensive check. > In the reproducer, the reflink source is just a part of a larger Again, the reproducer needs to be fixed (yes, I tested it even if it's clear by looking at it that it doesn't trigger the nocow case). > preallocated extent, we need to flush all buffered write of that extent > before reflink. > Such backward search can be complex and we may not get much benefit from > it. > > So this patch will just try to flush the whole inode before reflink. > > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> > --- > Reason for RFC: > Flushing an inode just because it's a reflink source is definitely > overkilling, but I don't have any better way to handle it. > > Any comment on this is welcomed. > --- > fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ > 1 file changed, 22 insertions(+) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 7755b503b348..8caa0edb6fbf 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, > return ret; > } > > + /* > + * Workaround to make sure NOCOW buffered write reach disk as NOCOW. > + * > + * Due to the limit of btrfs extent tree design, we can only have > + * extent level share view. Any part of an extent is shared then the Any -> If any > + * whole extent is shared and any write into that extent needs to fall is -> is considered > + * back to COW. I would add, something like: That is, btrfs' back references do not have a block level granularity, they work at the whole extent level. > + * > + * NOCOW buffered write without data space reserved could to lead to > + * either data space bytes_may_use underflow (kernel warning) or ENOSPC > + * at delalloc time (transaction abort). I would omit the warning and transaction abort parts, that can change any time. And we have that information in the changelog, so it's not lost. > + * > + * Here we take a shortcut by flush the whole inode. We could do better > + * by finding all extents in that range and flush the space referring > + * all those extents. > + * But that's too complex for such corner case. > + */ > + filemap_flush(src->i_mapping); > + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > + &BTRFS_I(src)->runtime_flags)) > + filemap_flush(src->i_mapping); So a few comments here: - why just in the clone part? The dedupe side has the same problem, doesn't it? - I would move such flushing to btrfs_remap_file_range_prep - this is where we do the source and target range flush and wait. Can you turn the reproducer into an fstests case? Thanks. > + > /* > * Lock destination range to serialize with concurrent readpages() and > * source range to serialize with relocation. > -- > 2.21.0 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
