On Fri, Feb 13, 2015 at 04:56:14PM +0000, Filipe Manana wrote:
> If we are recording in the tree log that an inode has new names (new hard
> links were added), we would drop items, belonging to the inode, that we
> shouldn't:
>
> 1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
> flags, we ended up dropping all the extent and xattr items that were
> previously logged. This was done only in memory, since logging a new
> name doesn't imply syncing the log;
>
> 2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
> flags, we ended up dropping all the xattr items that were previously
> logged. Like the case before, this was done only in memory because
> logging a new name doesn't imply syncing the log.
>
> This led to some surprises in scenarios such as the following:
>
> 1) write some extents to an inode;
> 2) fsync the inode;
> 3) truncate the inode or delete/modify some of its xattrs
> 4) add a new hard link for that inode
> 5) fsync some other file, to force the log tree to be durably persisted
> 6) power failure happens
>
> The next time the fs is mounted, the fsync log replay code is executed,
> and the resulting file doesn't have the content it had when the last fsync
> against it was performed, instead if has a content matching what it had
> when the last transaction commit happened.
>
> So change the behaviour such that when a new name is logged, only the inode
> item and reference items are processed.
>
> This is easy to reproduce with the test I just made for xfstests, whose
> main body is:
>
> _scratch_mkfs >> $seqres.full 2>&1
> _init_flakey
> _mount_flakey
>
> # Create our test file with some data.
> $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
> $SCRATCH_MNT/foo | _filter_xfs_io
>
> # Make sure the file is durably persisted.
> sync
>
> # Append some data to our file, to increase its size.
> $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
> $SCRATCH_MNT/foo | _filter_xfs_io
>
> # Fsync the file, so from this point on if a crash/power failure happens, our
> # new data is guaranteed to be there next time the fs is mounted.
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>
> # Now shrink our file to 5000 bytes.
> $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo
>
> # Now do an expanding truncate to a size larger than what we had when we last
> # fsync'ed our file. This is just to verify that after power failure and
> # replaying the fsync log, our file matches what it was when we last fsync'ed
> # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
> # data had a value of 0xcc.
> $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo
>
> # Add one hard link to our file. This made btrfs drop all of our file's
> # metadata from the fsync log, including the metadata relative to the
> # extent we just wrote and fsync'ed. This change was made only to the fsync
> # log in memory, so adding the hard link alone doesn't change the persisted
> # fsync log. This happened because the previous truncates set the runtime
> # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
> ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
>
> # Now make sure the in memory fsync log is durably persisted.
> # Creating and fsync'ing another file will do it.
> # After this our persisted fsync log will no longer have metadata for our file
> # foo that points to the extent we wrote and fsync'ed before.
> touch $SCRATCH_MNT/bar
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
>
> # As expected, before the crash/power failure, we should be able to see a file
> # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
> # the remaining bytes with value 0x00.
> echo "File content before:"
> od -t x1 $SCRATCH_MNT/foo
>
> # Simulate a crash/power loss.
> _load_flakey_table $FLAKEY_DROP_WRITES
> _unmount_flakey
>
> _load_flakey_table $FLAKEY_ALLOW_WRITES
> _mount_flakey
>
> # After mounting the fs again, the fsync log was replayed.
> # The expected result is to see a file with a size of 12Kb, with its first 8Kb
> # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
> # The btrfs bug used to leave the file as it used te be as of the last
> # transaction commit - that is, with a size of 8Kb with all bytes having a
> # value of 0xaa.
> echo "File content after:"
> od -t x1 $SCRATCH_MNT/foo
>
> The test case for xfstests follows soon.
>
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
> ---
>
> V2: Added missing assignment to max_key.type for directory inode case.
>
> V3: Fixed the test/clear bit logic so that the bits are only cleared
> if we aren't logging new names.
>
> fs/btrfs/tree-log.c | 39 +++++++++++++++++++++++++++------------
> 1 file changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index b0fe52a..aa4ebea 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4069,8 +4069,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
> if (S_ISDIR(inode->i_mode)) {
> int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
>
> - if (inode_only == LOG_INODE_EXISTS)
> - max_key_type = BTRFS_XATTR_ITEM_KEY;
> + if (inode_only == LOG_INODE_EXISTS) {
> + max_key_type = BTRFS_INODE_EXTREF_KEY;
> + max_key.type = max_key_type;
> + }
> ret = drop_objectid_items(trans, log, path, ino, max_key_type);
> } else {
> if (inode_only == LOG_INODE_EXISTS) {
> @@ -4092,18 +4094,31 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
> if (err)
> goto out_unlock;
> }
> - if (test_and_clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> - &BTRFS_I(inode)->runtime_flags)) {
> - clear_bit(BTRFS_INODE_COPY_EVERYTHING,
> - &BTRFS_I(inode)->runtime_flags);
> - ret = btrfs_truncate_inode_items(trans, log,
> - inode, 0, 0);
> - } else if (test_and_clear_bit(BTRFS_INODE_COPY_EVERYTHING,
> - &BTRFS_I(inode)->runtime_flags) ||
> + if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> + &BTRFS_I(inode)->runtime_flags)) {
> + if (inode_only == LOG_INODE_EXISTS) {
> + max_key.type = BTRFS_INODE_EXTREF_KEY;
> + ret = drop_objectid_items(trans, log, path, ino,
> + max_key.type);
BTRFS_INODE_NEEDS_FULL_SYNC is set in several cases, one of them is
'evict inode and re-read', and this patch is made for truncate,
can you make sure that there is no side effect in other cases?
Thanks,
-liubo
> + } else {
> + clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> + &BTRFS_I(inode)->runtime_flags);
> + clear_bit(BTRFS_INODE_COPY_EVERYTHING,
> + &BTRFS_I(inode)->runtime_flags);
> + ret = btrfs_truncate_inode_items(trans, log,
> + inode, 0, 0);
> + }
> + } else if (test_bit(BTRFS_INODE_COPY_EVERYTHING,
> + &BTRFS_I(inode)->runtime_flags) ||
> inode_only == LOG_INODE_EXISTS) {
> - if (inode_only == LOG_INODE_ALL)
> + if (inode_only == LOG_INODE_ALL) {
> + clear_bit(BTRFS_INODE_COPY_EVERYTHING,
> + &BTRFS_I(inode)->runtime_flags);
> fast_search = true;
> - max_key.type = BTRFS_XATTR_ITEM_KEY;
> + max_key.type = BTRFS_XATTR_ITEM_KEY;
> + } else {
> + max_key.type = BTRFS_INODE_EXTREF_KEY;
> + }
> ret = drop_objectid_items(trans, log, path, ino,
> max_key.type);
> } else {
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html