On Fri, Jan 31, 2020 at 02:06:07PM +0000, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> We have a few cases where we allow an extent map that is in an extent map
> tree to be merged with other extents in the tree. Such cases include the
> unpinning of an extent after the respective ordered extent completed or
> after logging an extent during a fast fsync. This can lead to subtle and
> dangerous problems because when doing the merge some other task might be
> using the same extent map and as consequence see an inconsistent state of
> the extent map - for example sees the new length but has seen the old start
> offset.
>
> With luck this triggers a BUG_ON(), and not some silent bug, such as the
> following one in __do_readpage():
>
> $ cat -n fs/btrfs/extent_io.c
> 3061 static int __do_readpage(struct extent_io_tree *tree,
> 3062 struct page *page,
> (...)
> 3127 em = __get_extent_map(inode, page, pg_offset, cur,
> 3128 end - cur + 1, get_extent, em_cached);
> 3129 if (IS_ERR_OR_NULL(em)) {
> 3130 SetPageError(page);
> 3131 unlock_extent(tree, cur, end);
> 3132 break;
> 3133 }
> 3134 extent_offset = cur - em->start;
> 3135 BUG_ON(extent_map_end(em) <= cur);
> (...)
>
> Consider the following example scenario, where we end up hitting the
> BUG_ON() in __do_readpage().
>
> We have an inode with a size of 8Kb and 2 extent maps:
>
> extent A: file offset 0, length 4Kb, disk_bytenr = X, persisted on disk by
> a previous transaction
>
> extent B: file offset 4Kb, length 4Kb, disk_bytenr = X + 4Kb, not yet
> persisted but writeback started for it already. The extent map
> is pinned since there's writeback and an ordered extent in
> progress, so it can not be merged with extent map A yet
>
> The following sequence of steps leads to the BUG_ON():
>
> 1) The ordered extent for extent B completes, the respective page gets its
> writeback bit cleared and the extent map is unpinned, at that point it
> is not yet merged with extent map A because it's in the list of modified
> extents;
>
> 2) Due to memory pressure, or some other reason, the mm subsystem releases
> the page corresponding to extent B - btrfs_releasepage() is called and
> returns 1, meaning the page can be released as it's not dirty, not under
> writeback anymore and the extent range is not locked in the inode's
> iotree. However the extent map is not released, either because we are
> not in a context that allows memory allocations to block or because the
> inode's size is smaller than 16Mb - in this case our inode has a size
> of 8Kb;
>
> 3) Task B needs to read extent B and ends up __do_readpage() through the
> btrfs_readpage() callback. At __do_readpage() it gets a reference to
> extent map B;
>
> 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
> while holding the write lock on the inode's extent map tree - this
> results in try_merge_map() being called and since it's possible to merge
> extent map B with extent map A now (the extent map B was removed from
> the list of modified extents), the merging begins - it sets extent map
> B's start offset to 0 (was 4Kb), but before it increments the map's
> length to 8Kb (4kb + 4Kb), task A is at:
>
> BUG_ON(extent_map_end(em) <= cur);
>
> The call to extent_map_end() sees the extent map has a start of 0
> and a length still at 4Kb, so it returns 4Kb and 'cur' is 4Kb, so
> the BUG_ON() is triggered.
>
> So it's dangerous to modify an extent map that is in the tree, because some
> other task might have got a reference to it before and still using it, and
> needs to see a consistent map while using it. Generally this is very rare
> since most paths that lookup and use extent maps also have the file range
> locked in the inode's iotree. The fsync path is pretty much the only
> exception where we don't do it to avoid serialization with concurrent
> reads.
>
> Fix this by not allowing an extent map do be merged if if it's being used
> by tasks other then the one attempting to merge the extent map (when the
> reference count of the extent map is greater than 2).
>
> Reported-by: ryusuke1925 <st13s20@xxxxxxxxxxxxxxxxxxx>
> Reported-by: Koki Mitani <koki.mitani.xg@xxxxxxxxxxxxx>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
> CC: stable@xxxxxxxxxxxxxxx # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Added to misc-next, thanks.