On 2017年11月03日 18:59, Filipe Manana wrote:
> On Thu, Nov 2, 2017 at 7:04 AM, Qu Wenruo <wqu@xxxxxxxx> wrote:
>> [BUG]
>> If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
>> instantly cause kernel panic like:
>>
>> ------
>> ...
>> assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
>> ...
>> Call Trace:
>> btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
>> setup_items_for_insert+0x385/0x650 [btrfs]
>> __btrfs_drop_extents+0x129a/0x1870 [btrfs]
>> ...
>> ------
>>
>> [Cause]
>> Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
>> if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
>>
>> However some btrfs_mark_buffer_dirty() caller, like
>> setup_items_for_insert(), doesn't really initialize its item data but
>> only initialize its item pointers, leaving item data uninitialized.
>
> So instead of doing this juggling, the best would be to have it not call
> mark_buffer_dirty(), and leave that responsibility for the caller after
> it initializes the item data. I give you a very good reason for that below.
However setup_items_for_insert() is just one of the possible causes,
unless we overhaul all btrfs_mark_buffer_dirty() callers, it will be
whac-a-aole.
>
>>
>> This makes tree-checker catch uninitialized data as error, causing
>> such panic.
>>
>> [Fix]
>> The correct timing to check leaf validation should be before write IO or
>> after read IO.
>>
>> Just like ee have already done the tree validation check at btree
>> readpage end io hook, this patch will move the write time tree checker to
>> csum_dirty_buffer().
>>
>> As csum_dirty_buffer() is called just before submitting btree write bio, as
>> the call path shows:
>>
>> btree_submit_bio_hook()
>> |- __btree_submit_bio_start()
>> |- btree_csum_one_bio()
>> |- csum_dirty_buffer()
>> |- btrfs_check_leaf()
>>
>> By this we can ensure the leaf passed in is in consistent status, and
>> can check them without causing tons of false alert.
>>
>> Reported-by: Lakshmipathi.G <lakshmipathi.g@xxxxxxxxx>
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> fs/btrfs/disk-io.c | 26 +++++++++++++++++++-------
>> 1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index efce9a2fa9be..6c17bce2a05e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -506,6 +506,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
>> u64 start = page_offset(page);
>> u64 found_start;
>> struct extent_buffer *eb;
>> + int ret;
>>
>> eb = (struct extent_buffer *)page->private;
>> if (page != eb->pages[0])
>> @@ -524,7 +525,24 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
>> ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
>> btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
>>
>> - return csum_tree_block(fs_info, eb, 0);
>> + ret = csum_tree_block(fs_info, eb, 0);
>> + if (ret)
>> + return ret;
>> +
>> +#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>> + /*
>> + * Do extra check before we write the tree block into disk.
>> + */
>> + if (btrfs_header_level(eb) == 0) {
>> + ret = btrfs_check_leaf(fs_info->tree_root, eb);
>> + if (ret) {
>> + btrfs_print_leaf(eb);
>> + ASSERT(0);
>> + return ret;
>> + }
>> + }
>> +#endif
>> + return 0;
>> }
>>
>> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>> @@ -3847,12 +3865,6 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
>> percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
>> buf->len,
>> fs_info->dirty_metadata_batch);
>> -#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>> - if (btrfs_header_level(buf) == 0 && btrfs_check_leaf(root, buf)) {
>> - btrfs_print_leaf(buf);
>> - ASSERT(0);
>> - }
>> -#endif
>
> So there's a reason why btrfs_check_leaf() was called here, at
> mark_buffer_dirty(),
> instead of somewhere else like csum_dirty_buffer().
>
> The reason is that once some bad code inserts a key out of order for
> example (or did any
> other bad stuff that check_leaf() catched before you added the
> tree-checker thing), we
> would get a trace that pinpoints exactly where the bad code is. With
> this change, we will
> only know some is bad when writeback of the leaf starts, and before
> that happens, the leaf might
> have been changed dozens of times by many different functions (and
> this happens very
> often, it's far from being a unusual case), in which case the given
> trace won't tell you which code
> misbehaved. This makes it harder to find out bugs, and as it used to
> be it certainly helped me in
> the past several times. IOW, I would prefer what I mentioned earlier
> or, at very least, do those new
> checks that validate data only at writeback start time.
OK, I'll skip any item member check in btrfs_mark_buffer_dirty(), to
make it still give early warning, and do the full comprehensive check
only before writeback.
Thanks,
Qu
>
>> }
>>
>> static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
>> --
>> 2.14.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
Attachment:
signature.asc
Description: OpenPGP digital signature
