On Mon, Oct 29, 2018 at 09:42:06AM +0000, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> Recently we got a massive simplification for fsync, where for the fast
> path we no longer log new extents while their respective ordered extents
> are still running.
>
> However that simplification introduced a subtle regression for the case
> where we use a ranged fsync (msync). Consider the following example:
>
> CPU 0 CPU 1
>
> mmap write to range [2Mb, 4Mb[
> mmap write to range [512Kb, 1Mb[
> msync range [512K, 1Mb[
> --> triggers fast fsync
> (BTRFS_INODE_NEEDS_FULL_SYNC
> not set)
> --> creates extent map A for this
> range and adds it to list of
> modified extents
> --> starts ordered extent A for
> this range
> --> waits for it to complete
>
> writeback triggered for range
> [2Mb, 4Mb[
> --> create extent map B and
> adds it to the list of
> modified extents
> --> creates ordered extent B
>
> --> start looking for and logging
> modified extents
> --> logs extent maps A and B
> --> finds checksums for extent A
> in the csum tree, but not for
> extent B
> fsync (msync) finishes
>
> --> ordered extent B
> finishes and its
> checksums are added
> to the csum tree
>
> <power cut>
>
> After replaying the log, we have the extent covering the range [2Mb, 4Mb[
> but do not have the data checksum items covering that file range.
>
> This happens because at the very beginning of an fsync (btrfs_sync_file())
> we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
> wait for any ordered extents in that range to complete before we start
> logging the extents. However if right before we start logging the extent
> in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
> such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
> or msync (btrfs_sync_file() starts writeback before acquiring the inode's
> lock), an ordered extent is created for that other range and a new extent
> map is created to represent that range and added to the inode's list of
> modified extents.
>
> That means that we will see that other extent in that list when collecting
> extents for logging (done at btrfs_log_changed_extents()) and log the
> extent before the respective ordered extent finishes - namely before the
> checksum items are added to the checksums tree, which is where
> log_extent_csums() looks for the checksums, therefore making us log an
> extent without logging its checksums. Before that massive simplification
> of fsync, this wasn't a problem because besides looking for checkums in
> the checksums tree, we also looked for them in any ordered extent still
> running.
>
> The consequence of data checksums missing for a file range is that users
> attempting to read the affected file range will get -EIO errors and dmesg
> reports the following:
>
> [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
> [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
>
> So fix this by skipping an extents outside of our logging range at
> btrfs_log_changed_extents() and leaving them on the list of modified
> extents so that any subsequent ranged fsync may collect them if needed.
> Also, if we find a hole extent outside of the range still log it, just
> to prevent having gaps between extent items after replaying the log,
> otherwise fsck will complain when we are not using the NO_HOLES feature
> (fstest btrfs/056 triggers such case).
>
> Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path")
> CC: stable@xxxxxxxxxxxxxxx # 4.19+
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Nice catch,
Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
Josef