On Wed, May 8, 2019 at 1:43 PM Qu Wenruo <wqu@xxxxxxxx> wrote: > > [BUG] > The following script can cause unexpected fsync failure: > > #!/bin/bash > > dev=/dev/test/test > mnt=/mnt/btrfs > > mkfs.btrfs -f $dev -b 512M > /dev/null > mount $dev $mnt -o nospace_cache > > # Prealloc one extent > xfs_io -f -c "falloc 8k 64m" $mnt/file1 > # Fill the remaining data space > xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding > sync > > # Write into the prealloc extent > xfs_io -c "pwrite 1m 16m" $mnt/file1 > > # Reflink then fsync, fsync would fail due to ENOSPC > xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 > umount $dev > > The fsync fails with ENOSPC, and the last page of the buffered write is > lost. > > [CAUSE] > This is caused by: > - Btrfs' back reference only has extent level granularity > So write into shared extent must be CoWed even only part of the extent > is shared. > > So for above script we have: > - fallocate > Create a preallocated extent where we can do NOCOW write. > > - fill all the remaining data and unallocated space > > - buffered write into preallocated space > As we have not enough space available for data and the extent is not > shared (yet) we fall into NOCOW mode. > > - reflink > Now part of the large preallocated extent is shared, later write > into that extent must be CoWed. > > - fsync triggers writeback > But now the extent is shared and therefore we must fallback into COW > mode, which fails with ENOSPC since there's not enough space to > allocate data extents. > > [WORKAROUND] > The workaround is to ensure any buffered write in the related extents > (not just the reflink source range) get flushed before reflink/dedupe, > so that NOCOW writes succeed that happened before reflinking succeed. > > The workaround is expensive, we could do it better by only flushing > NOCOW range, but that needs extra accounting for NOCOW range. > For now, fix the possible data loss first. > > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx> Thanks. > --- > changelog: > RFC->v1: > - Use better words for commit message and comment. > - Move the whole inode flushing to btrfs_remap_file_range_prep(). > This also covers dedupe. > - Update the reproducer to fail explicitly. > - Remove false statement on transaction abort. > > v1->v2: > - Extra comment and commit message refine. > - Don't wait ordered extent, only flush (starts delalloc) > Single filemap_flush() should be enough. The async extent part will > never be NOCOWed, thus no need to worry. > --- > fs/btrfs/ioctl.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 6dafa857bbb9..0e35bef2ec59 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -4001,6 +4001,27 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, > if (!same_inode) > inode_dio_wait(inode_out); > > + /* > + * Workaround to make sure NOCOW buffered write reach disk as NOCOW. > + * > + * Btrfs' back references do not have a block level granularity, they > + * work at the whole extent level. > + * NOCOW buffered write without data space reserved may not be able > + * to fall back to CoW due to lack of data space, thus could cause > + * data loss. > + * > + * Here we take a shortcut by flushing the whole inode, so that all > + * nocow write should reach disk as nocow before we increase the > + * reference of the extent. We could do better by only flushing NOCOW > + * data, but that needs extra accounting. > + * > + * Also we don't need to check ASYNC_EXTENT, as async extent will be > + * CoWed anyway, not affecting nocow part. > + */ > + ret = filemap_flush(inode_in->i_mapping); > + if (ret < 0) > + return ret; > + > ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs), > wb_len); > if (ret < 0) > -- > 2.21.0 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
