On 2020/1/30 下午6:02, Filipe Manana wrote: > On Thu, Jan 30, 2020 at 5:30 AM Qu Wenruo <wqu@xxxxxxxx> wrote: >> >> [BUG] >> When the data space is exhausted, even the inode has NOCOW attribute, >> btrfs will still refuse to truncate unaligned range due to ENOSPC. >> >> The following script can reproduce it pretty easily: >> #!/bin/bash >> >> dev=/dev/test/test >> mnt=/mnt/btrfs >> >> umount $dev &> /dev/null >> umount $mnt&> /dev/null >> >> mkfs.btrfs -f $dev -b 1G >> mount -o nospace_cache $dev $mnt >> touch $mnt/foobar >> chattr +C $mnt/foobar >> >> xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null >> xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null >> sync >> >> xfs_io -c "fpunch 0 2k" $mnt/foobar >> umount $mnt >> >> Current btrfs will fail at the fpunch part. >> >> [CAUSE] >> Because btrfs_truncate_block() always reserve space without checking the >> NOCOW attribute. >> >> Since the writeback path follows NOCOW bit, we only need to bother the >> space reservation code in btrfs_truncate_block(). >> >> [FIX] >> Make btrfs_truncate_block() to follow btrfs_buffered_write() to try to >> reserve data space first, and falls back to NOCOW check only when we >> don't have enough space. >> >> Such always-try-reserve is an optimization introduced in >> btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. >> >> Since now check_can_nocow() is needed outside of inode.c, also export it >> and rename it to btrfs_check_can_nocow(). >> >> Reported-by: Martin Doucha <martin.doucha@xxxxxxxx> >> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> >> --- >> Test case will be submitted to fstests by the reporter. > > Well, this is a sudden change of mind, isn't it? :) > > We had btrfs/172, which you removed very recently, that precisely tested this: > > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=538d8a4bcc782258f8f95fae815d5e859dee9126 I didn't notice the nodatacow mount option. Super duper big facepalm. All my bad, especially feel sorry for Anand. With nodatacow mount option there, that test case in fact makes a lot of sense. Sorry again for that. Anand, mind to resubmit it to generic group? Thanks, Qu > > Even though there are several reasons why this can still fail (at > writeback time), like regular buffered writes through the family of > write() syscalls can, I think it's perfectly fine to have this > behaviour. > > Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx> > > So I think we can just resurrect btrfs/172 now... > >> --- >> fs/btrfs/ctree.h | 2 ++ >> fs/btrfs/file.c | 10 +++++----- >> fs/btrfs/inode.c | 41 ++++++++++++++++++++++++++++++++++------- >> 3 files changed, 41 insertions(+), 12 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 54efb21c2727..b5639f3461e4 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -2954,6 +2954,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); >> loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in, >> struct file *file_out, loff_t pos_out, >> loff_t len, unsigned int remap_flags); >> +int btrfs_check_can_nocow(struct btrfs_inode *inode, loff_t pos, >> + size_t *write_bytes); >> >> /* tree-defrag.c */ >> int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c >> index 8d47c76b7bd1..8dc084600f4e 100644 >> --- a/fs/btrfs/file.c >> +++ b/fs/btrfs/file.c >> @@ -1544,8 +1544,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages, >> return ret; >> } >> >> -static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, >> - size_t *write_bytes) >> +int btrfs_check_can_nocow(struct btrfs_inode *inode, loff_t pos, >> + size_t *write_bytes) >> { >> struct btrfs_fs_info *fs_info = inode->root->fs_info; >> struct btrfs_root *root = inode->root; >> @@ -1645,8 +1645,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, >> if (ret < 0) { >> if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | >> BTRFS_INODE_PREALLOC)) && >> - check_can_nocow(BTRFS_I(inode), pos, >> - &write_bytes) > 0) { >> + btrfs_check_can_nocow(BTRFS_I(inode), pos, >> + &write_bytes) > 0) { >> /* >> * For nodata cow case, no need to reserve >> * data space. >> @@ -1923,7 +1923,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, >> */ >> if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | >> BTRFS_INODE_PREALLOC)) || >> - check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) { >> + btrfs_check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) { >> inode_unlock(inode); >> return -EAGAIN; >> } >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 5509c41a4f43..b5ae4bbf1ad4 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -4974,11 +4974,13 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, >> struct extent_state *cached_state = NULL; >> struct extent_changeset *data_reserved = NULL; >> char *kaddr; >> + bool only_release_metadata = false; >> u32 blocksize = fs_info->sectorsize; >> pgoff_t index = from >> PAGE_SHIFT; >> unsigned offset = from & (blocksize - 1); >> struct page *page; >> gfp_t mask = btrfs_alloc_write_mask(mapping); >> + size_t write_bytes = blocksize; >> int ret = 0; >> u64 block_start; >> u64 block_end; >> @@ -4990,11 +4992,26 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, >> block_start = round_down(from, blocksize); >> block_end = block_start + blocksize - 1; >> >> - ret = btrfs_delalloc_reserve_space(inode, &data_reserved, >> - block_start, blocksize); >> - if (ret) >> + ret = btrfs_check_data_free_space(inode, &data_reserved, block_start, >> + blocksize); >> + if (ret < 0) { >> + if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | >> + BTRFS_INODE_PREALLOC)) && >> + btrfs_check_can_nocow(BTRFS_I(inode), block_start, >> + &write_bytes) > 0) { >> + /* For nocow case, no need to reserve data space. */ >> + only_release_metadata = true; >> + } else { >> + goto out; >> + } >> + } >> + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), blocksize); >> + if (ret < 0) { >> + if (!only_release_metadata) >> + btrfs_free_reserved_data_space(inode, data_reserved, >> + block_start, blocksize); >> goto out; >> - >> + } >> again: >> page = find_or_create_page(mapping, index, mask); >> if (!page) { >> @@ -5063,10 +5080,20 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, >> set_page_dirty(page); >> unlock_extent_cached(io_tree, block_start, block_end, &cached_state); >> >> + if (only_release_metadata) >> + set_extent_bit(&BTRFS_I(inode)->io_tree, block_start, >> + block_end, EXTENT_NORESERVE, NULL, NULL, >> + GFP_NOFS); >> + >> out_unlock: >> - if (ret) >> - btrfs_delalloc_release_space(inode, data_reserved, block_start, >> - blocksize, true); >> + if (ret) { >> + if (!only_release_metadata) >> + btrfs_delalloc_release_space(inode, data_reserved, >> + block_start, blocksize, true); >> + else >> + btrfs_delalloc_release_metadata(BTRFS_I(inode), >> + blocksize, true); > > I usually find it more intuitive to have it the other way around: > > if (only_release_metadata) > ... > else > ... > > E.g., positive case first, negative in the else branch. But that's > likely too much of a personal preference. > > Thanks. > >> + } >> btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize); >> unlock_page(page); >> put_page(page); >> -- >> 2.25.0 >> > >
Attachment:
signature.asc
Description: OpenPGP digital signature
