Unreliable btrfs_cross_ref_exist() check for self cloned inode due to lack of sub-extent level check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

[BUG]
During the dig into the nodatacow miss cases, I find the following
operations can lead to unexpected CoW.

  mkfs.btrfs -f $dev -b 1G > /dev/null
  mount $dev $mnt -o nospace_cache

  xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
  xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
  umount $dev

The result is the [12k, 20k) range get CoWed.

	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 28672
	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
		generation 6 type 1 (regular)
		extent data disk byte 13660160 nr 12288 <<< not 13631488
	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 28672

[Why this matters]
Some guy may just ignore this problem and call me overreacting, as long
as data is CoWed, there is no data loss.

But I could argue that:
- This breaks the fallocate behavior
  Without snapshot or shared, we should always be able to write into
  preallocated extents.
  But we get CoWed, means we are forced to allocate new extent, this can
  even fail at delalloc time and leads to trans abort.

- This behavior only happens after that self-clone.
  If remove that reflink call, everything goes expected:

    xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
    umount

  Then we got:
    	item 7 key (257 EXTENT_DATA 8192) itemoff 15760 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 24576
	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
		generation 6 type 1 (regular)
		extent data disk byte 13631488 nr 24576
	item 9 key (257 EXTENT_DATA 20480) itemoff 15654 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 24576


[Cause]
It's directly caused by check_delayed_ref().

At delalloc time, also we goes into run_delalloc_nocow(), it still
causes btrfs_cross_ref_exist() to verify we're not writing into
shared/hole extent.

Then it calls check_delayed_ref() for extent 13631488.

Due to the last reflink, we increased one ref on extent 13631488.
The backref offset of that data ref is file_offset (0) - extent_offset (0).

So in delayed ref, 13631488 have two different refs, one with offset
8192 (at file offset 20K and 8K), and one with offset 0 (the reflinked
one at file offset 0).

Now in check_delayed_ref(), we will verify all the references to that
data extent has the same backref offset.
But we have one ref with backref offset 0, not the 8K we're expected.

Then check_delayed_ref() thinks we're writing into a shared extent, then
falls back to CoW.

[Fix?]
For this problem, we need sub-extent level shared check.
The current delayed ref with on-disk extent tree can't provide such
facility.
Any idea on this problem is welcomed.

Or nodatacow is always a second-class citizen in the btrfs world?

Thanks,
Qu

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux