On Thu, Jul 12, 2018 at 01:36:43AM +0100, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> When we clone a range into a file we can end up dropping existing
> extent maps (or trimming them) and replacing them with new ones if the
> range to be cloned overlaps with a range in the destination inode.
> When that happens we add the new extent maps to the list of modified
> extents in the inode's extent map tree, so that a "fast" fsync (the flag
> BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
> and log corresponding extent items. However, at the end of range cloning
> operation we do truncate all the pages in the affected range (in order to
> ensure future reads will not get stale data). Sometimes this truncation
> will release the corresponding extent maps besides the pages from the page
> cache. If this happens, then a "fast" fsync operation will miss logging
> some extent items, because it relies exclusively on the extent maps being
> present in the inode's extent tree, leading to data loss/corruption if
> the fsync ends up using the same transaction used by the clone operation
> (that transaction was not committed in the meanwhile). An extent map is
> released through the callback btrfs_invalidatepage(), which gets called by
> truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
> later ends up calling try_release_extent_mapping() which will release the
> extent map if some conditions are met, like the file size being greater
> than 16Mb, gfp flags allow blocking and the range not being locked (which
> is the case during the clone operation) nor being the extent map flagged
> as pinned (also the case for cloning).
>
> The following example, turned into a test for fstests, reproduces the
> issue:
>
> $ mkfs.btrfs -f /dev/sdb
> $ mount /dev/sdb /mnt
>
> $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
> $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
>
> $ xfs_io -c "fsync" /mnt/bar
> # reflink destination offset corresponds to the size of file bar,
> # 2728Kb minus 4Kb.
> $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
> $ xfs_io -c "fsync" /mnt/bar
>
> $ md5sum /mnt/bar
> 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
>
> <power fail>
>
> $ mount /dev/sdb /mnt
> $ md5sum /mnt/bar
> 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
> # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
> # power failure
>
> In the above example, the destination offset of the clone operation
> corresponds to the size of the "bar" file minus 4Kb. So during the clone
> operation, the extent map covering the range from 2572Kb to 2728Kb gets
> trimmed so that it ends at offset 2724Kb, and a new extent map covering
> the range from 2724Kb to 11724Kb is created. So at the end of the clone
> operation when we ask to truncate the pages in the range from 2724Kb to
> 2724Kb + 15908Kb, the page invalidation callback ends up removing the new
> extent map (through try_release_extent_mapping()) when the page at offset
> 2724Kb is passed to that callback.
>
> Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
> map is removed at try_release_extent_mapping(), forcing the next fsync to
> search for modified extents in the fs/subvolume tree instead of relying on
> the presence of extent maps in memory. This way we can continue doing a
> "fast" fsync if the destination range of a clone operation does not
> overlap with an existing range or if any of the criteria necessary to
> remove an extent map at try_release_extent_mapping() is not met (file
> size not bigger then 16Mb or gfp flags do not allow blocking).
>
> CC: stable@xxxxxxxxxxxxxxx # 3.16+
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
I've added this to misc-next and will forward it to 4.18 as the type of
fix qualifies for a late rc state, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html