On Mon, Mar 04, 2019 at 02:06:12PM +0000, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> If we do a shrinking truncate against an inode which is already present
> in the respective log tree and then rename it, as part of logging the new
> name we end up logging an inode item that reflects the old size of the
> file (the one which we previously logged) and not the new smaller size.
> The decision to preserve the size previously logged was added by commit
> 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to
> inode") in order to avoid data loss after replaying the log. However that
> decision is only needed for the case the logged inode size is smaller then
> the current size of the inode, as explained in that commit's change log.
> If the current size of the inode is smaller then the previously logged
> size, we know a shrinking truncate happened and therefore need to use
> that smaller size.
>
> Example to trigger the problem:
>
> $ mkfs.btrfs -f /dev/sdb
> $ mount /dev/sdb /mnt
>
> $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
> $ xfs_io -c "fsync" /mnt/foo
> $ xfs_io -c "truncate 3000" /mnt/foo
>
> $ mv /mnt/foo /mnt/bar
> $ xfs_io -c "fsync" /mnt/bar
>
> <power failure>
>
> $ mount /dev/sdb /mnt
> $ od -t x1 -A d /mnt/bar
> 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> *
> 0008000
>
> Once we rename the file, we log its name (and inode item), and because
> the inode was already logged before in the current transaction, we log it
> with a size of 8000 bytes because that is the size we previously logged
> (with the first fsync). As part of the rename, besides logging the inode,
> we do also sync the log, which is done since commit d4682ba03ef618
> ("Btrfs: sync log after logging new name"), so the next fsync against our
> inode is effectively a no-op, since no new changes happened since the
> rename operation. Even if did not sync the log during the rename
> operation, the same problem (fize size of 8000 bytes instead of 3000
> bytes) would be visible after replaying the log if the log ended up
> getting synced to disk through some other means, such as for example by
> fsyncing some other modified file. In the example above the fsync after
> the rename operation is there just because not every filesystem may
> guarantee logging/journalling the inode (and syncing the log/journal)
> during the rename operation, for example it is needed for f2fs, but not
> for ext4 and xfs.
>
> Fix this scenario by, when logging a new name (which is triggered by
> rename and link operations), using the current size of the inode instead
> of the previously logged inode size.
>
> A test case for fstests follows soon.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
> Reported-by: Seulbae Kim <seulbae@xxxxxxxxxx>
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Added to misc-next, thanks.