On Fri, Jun 07, 2019 at 11:25:24AM +0100, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> When we log an inode, regardless of logging it completely or only that it
> exists, we always update it as logged (logged_trans and last_log_commit
> fields of the inode are updated). This is generally fine and avoids future
> attempts to log it from having to do repeated work that brings no value.
>
> However, if we write data to a file, then evict its inode after all the
> dealloc was flushed (and ordered extents completed), rename the file and
> fsync it, we end up not logging the new extents, since the rename may
> result in logging that the inode exists in case the parent directory was
> logged before. The following reproducer shows and explains how this can
> happen:
>
> $ mkfs.btrfs -f /dev/sdb
> $ mount /dev/sdb /mnt
>
> $ mkdir /mnt/dir
> $ touch /mnt/dir/foo
> $ touch /mnt/dir/bar
>
> # Do a direct IO write instead of a buffered write because with a
> # buffered write we would need to make sure dealloc gets flushed and
> # complete before we do the inode eviction later, and we can not do that
> # from user space with call to things such as sync(2) since that results
> # in a transaction commit as well.
> $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
>
> # Keep the directory dir in use while we evict inodes. We want our file
> # bar's inode to be evicted but we don't want our directory's inode to
> # be evicted (if it were evicted too, we would not be able to reproduce
> # the issue since the first fsync below, of file foo, would result in a
> # transaction commit.
> $ ( cd /mnt/dir; while true; do :; done ) &
> $ pid=$!
>
> # Wait a bit to give time for the background process to chdir.
> $ sleep 0.1
>
> # Evict all inodes, except the inode for the directory dir because it is
> # currently in use by our background process.
> $ echo 2 > /proc/sys/vm/drop_caches
>
> # fsync file foo, which ends up persisting information about the parent
> # directory because it is a new inode.
> $ xfs_io -c fsync /mnt/dir/foo
>
> # Rename bar, this results in logging that this inode exists (inode item,
> # names, xattrs) because the parent directory is in the log.
> $ mv /mnt/dir/bar /mnt/dir/baz
>
> # Now fsync baz, which ends up doing absolutely nothing because of the
> # rename operation which logged that the inode exists only.
> $ xfs_io -c fsync /mnt/dir/baz
>
> <power failure>
>
> $ mount /dev/sdb /mnt
> $ od -t x1 -A d /mnt/dir/baz
> 0000000
>
> --> Empty file, data we wrote is missing.
>
> Fix this by not updating last_sub_trans of an inode when we are logging
> only that it exists and the inode was not yet logged since it was loaded
> from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
> return false for this scenario and make us log the inode. The logged_trans
> of the inode is still always setsince that alone is used to track if names
> need to be deleted as part of unlink operations.
>
> Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes")
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Added to misc-next, thanks.