On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
> On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@xxxxxxxxxx wrote:
>> From: Filipe Manana <fdmanana@xxxxxxxx>
>>
>> When we have the no_holes feature enabled, if a we truncate a file to a
>> smaller size, truncate it again but to a size greater than or equals to
>> its original size and fsync it, the log tree will not have any information
>> about the hole covering the range [truncate_1_offset, new_file_size[.
>> Which means if the fsync log is replayed, the file will remain with the
>> state it had before both truncate operations.
>
> Does the fs/subvol tree get updated to the right information at this
> time?
No, and that's the problem. Because no file extent items are stored in
the log tree.
The inode item is updated with the new i_size however (as expected).
thanks
>
> Thanks,
>
> -liubo
>>
>> Without the no_holes feature this does not happen, since when the inode
>> is logged (full sync flag is set) it will find in the fs/subvol tree a
>> leaf with a generation matching the current transaction id that has an
>> explicit extent item representing the hole.
>>
>> Fix this by adding an explicit extent item representing a hole between
>> the last extent and the inode's i_size if we are doing a full sync.
>>
>> The issue is easy to reproduce with the following test case for fstests:
>>
>> . ./common/rc
>> . ./common/filter
>> . ./common/dmflakey
>>
>> _need_to_be_root
>> _supported_fs generic
>> _supported_os Linux
>> _require_scratch
>> _require_dm_flakey
>>
>> # This test was motivated by an issue found in btrfs when the btrfs
>> # no-holes feature is enabled (introduced in kernel 3.14). So enable
>> # the feature if the fs being tested is btrfs.
>> if [ $FSTYP == "btrfs" ]; then
>> _require_btrfs_fs_feature "no_holes"
>> _require_btrfs_mkfs_feature "no-holes"
>> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
>> fi
>>
>> rm -f $seqres.full
>>
>> _scratch_mkfs >>$seqres.full 2>&1
>> _init_flakey
>> _mount_flakey
>>
>> # Create our test files and make sure everything is durably persisted.
>> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
>> -c "pwrite -S 0xbb 64K 61K" \
>> $SCRATCH_MNT/foo | _filter_xfs_io
>> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
>> -c "pwrite -S 0xff 64K 61K" \
>> $SCRATCH_MNT/bar | _filter_xfs_io
>> sync
>>
>> # Now truncate our file foo to a smaller size (64Kb) and then truncate
>> # it to the size it had before the shrinking truncate (125Kb). Then
>> # fsync our file. If a power failure happens after the fsync, we expect
>> # our file to have a size of 125Kb, with the first 64Kb of data having
>> # the value 0xaa and the second 61Kb of data having the value 0x00.
>> $XFS_IO_PROG -c "truncate 64K" \
>> -c "truncate 125K" \
>> -c "fsync" \
>> $SCRATCH_MNT/foo
>>
>> # Do something similar to our file bar, but the first truncation sets
>> # the file size to 0 and the second truncation expands the size to the
>> # double of what it was initially.
>> $XFS_IO_PROG -c "truncate 0" \
>> -c "truncate 253K" \
>> -c "fsync" \
>> $SCRATCH_MNT/bar
>>
>> _load_flakey_table $FLAKEY_DROP_WRITES
>> _unmount_flakey
>>
>> # Allow writes again, mount to trigger log replay and validate file
>> # contents.
>> _load_flakey_table $FLAKEY_ALLOW_WRITES
>> _mount_flakey
>>
>> # We expect foo to have a size of 125Kb, the first 64Kb of data all
>> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
>> # with value 0x00).
>> echo "File foo content after log replay:"
>> od -t x1 $SCRATCH_MNT/foo
>>
>> # We expect bar to have a size of 253Kb and no extents (any byte read
>> # from bar has the value 0x00).
>> echo "File bar content after log replay:"
>> od -t x1 $SCRATCH_MNT/bar
>>
>> status=0
>> exit
>>
>> The expected file contents in the golden output are:
>>
>> File foo content after log replay:
>> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> *
>> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0372000
>> File bar content after log replay:
>> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0772000
>>
>> Without this fix, their contents are:
>>
>> File foo content after log replay:
>> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> *
>> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>> *
>> 0372000
>> File bar content after log replay:
>> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>> *
>> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> *
>> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0772000
>>
>> A test case submission for fstests follows soon.
>>
>> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>> ---
>> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 108 insertions(+)
>>
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 7ac45cf..ac90336 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
>> return 0;
>> }
>>
>> +/*
>> + * If the no holes feature is enabled we need to make sure any hole between the
>> + * last extent and the i_size of our inode is explicitly marked in the log. This
>> + * is to make sure that doing something like:
>> + *
>> + * 1) create file with 128Kb of data
>> + * 2) truncate file to 64Kb
>> + * 3) truncate file to 256Kb
>> + * 4) fsync file
>> + * 5) <crash/power failure>
>> + * 6) mount fs and trigger log replay
>> + *
>> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
>> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
>> + * file correspond to a hole. The presence of explicit holes in a log tree is
>> + * what guarantees that log replay will remove/adjust file extent items in the
>> + * fs/subvol tree.
>> + *
>> + * Here we do not need to care about holes between extents, that is already done
>> + * by copy_items(). We also only need to do this in the full sync path, where we
>> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
>> + * lookup the list of modified extent maps and if any represents a hole, we
>> + * insert a corresponding extent representing a hole in the log tree.
>> + */
>> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
>> + struct btrfs_root *root,
>> + struct inode *inode,
>> + struct btrfs_path *path)
>> +{
>> + int ret;
>> + struct btrfs_key key;
>> + u64 hole_start;
>> + u64 hole_size;
>> + struct extent_buffer *leaf;
>> + struct btrfs_root *log = root->log_root;
>> + const u64 ino = btrfs_ino(inode);
>> + const u64 i_size = i_size_read(inode);
>> +
>> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
>> + return 0;
>> +
>> + key.objectid = ino;
>> + key.type = BTRFS_EXTENT_DATA_KEY;
>> + key.offset = (u64)-1;
>> +
>> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> + ASSERT(ret != 0);
>> + if (ret < 0)
>> + return ret;
>> +
>> + ASSERT(path->slots[0] > 0);
>> + path->slots[0]--;
>> + leaf = path->nodes[0];
>> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> +
>> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>> + /* inode does not have any extents */
>> + hole_start = 0;
>> + hole_size = i_size;
>> + } else {
>> + struct btrfs_file_extent_item *extent;
>> + u64 len;
>> +
>> + /*
>> + * If there's an extent beyond i_size, an explicit hole was
>> + * already inserted by copy_items().
>> + */
>> + if (key.offset >= i_size)
>> + return 0;
>> +
>> + extent = btrfs_item_ptr(leaf, path->slots[0],
>> + struct btrfs_file_extent_item);
>> +
>> + if (btrfs_file_extent_type(leaf, extent) ==
>> + BTRFS_FILE_EXTENT_INLINE) {
>> + len = btrfs_file_extent_inline_len(leaf,
>> + path->slots[0],
>> + extent);
>> + ASSERT(len == i_size);
>> + return 0;
>> + }
>> +
>> + len = btrfs_file_extent_num_bytes(leaf, extent);
>> + /* Last extent goes beyond i_size, no need to log a hole. */
>> + if (key.offset + len > i_size)
>> + return 0;
>> + hole_start = key.offset + len;
>> + hole_size = i_size - hole_start;
>> + }
>> + btrfs_release_path(path);
>> +
>> + /* Last extent ends at i_size. */
>> + if (hole_size == 0)
>> + return 0;
>> +
>> + hole_size = ALIGN(hole_size, root->sectorsize);
>> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
>> + hole_size, 0, hole_size, 0, 0, 0);
>> + return ret;
>> +}
>> +
>> /* log a single inode in the tree log.
>> * At least one parent directory for this inode must exist in the tree
>> * or be logged already.
>> @@ -4466,6 +4567,13 @@ next_slot:
>> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
>> if (err)
>> goto out_unlock;
>> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
>> + btrfs_release_path(path);
>> + btrfs_release_path(dst_path);
>> + err = btrfs_log_trailing_hole(trans, root, inode, path);
>> + if (err)
>> + goto out_unlock;
>> + }
>> log_extents:
>> btrfs_release_path(path);
>> btrfs_release_path(dst_path);
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html