On 2018/11/6 下午10:40, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
>
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
>
> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> io_align 65536 io_width 65536 sector_size 4096
> num_stripes 2 sub_stripes 1
> stripe 0 devid 2 offset 2156920832
> dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> stripe 1 devid 1 offset 3553624064
> dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
>
> Signed-off-by: Nikolay Borisov <nborisov@xxxxxxxx>
> Suggested-by: Qu Wenruo <wqu@xxxxxxxx>
Reviewed-by: Qu Wenruo <wqu@xxxxxxxx>
However there is still a tiny piece missing.
Tree checker is done after some basic checks, including:
1) bytenr
2) level
3) fsid
4) csum
1~2) can be easily skipped just by pure luck.
But 3) and especially 4) are not that easy to hit.
Not to mention meeting both 3) and 4), since csum range covers fsid.
So I must say you're a really super lucky guy!
Thanks,
Qu
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
> fs/btrfs/disk-io.c | 11 +----------
> 1 file changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
> int mirror_num = 0;
> int failed_mirror = 0;
>
> - clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
> while (1) {
> + clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
> mirror_num);
> if (!ret) {
> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
> break;
> }
>
> - /*
> - * This buffer's crc is fine, but its contents are corrupted, so
> - * there is no reason to read the other copies, they won't be
> - * any less wrong.
> - */
> - if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> - ret == -EUCLEAN)
> - break;
> -
> num_copies = btrfs_num_copies(fs_info,
> eb->start, eb->len);
> if (num_copies == 1)
>