Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 26.10.2018 14:53, Qu Wenruo wrote:
> 
> 
> On 2018/10/26 下午7:41, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>> 	btrfs           D    0  5760   5324 0x00000000
>> 	Call Trace:
>> 	 ? __schedule+0x243/0x800
>> 	 schedule+0x33/0x90
>> 	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>> 	 ? wait_woken+0xa0/0xa0
>> 	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>> 	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>> 	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
>> 	 btrfs_balance+0xbeb/0x1740 [btrfs]
>> 	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>> 	 btrfs_ioctl+0x1691/0x3110 [btrfs]
>> 	 ? lockdep_hardirqs_on+0xed/0x180
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? _raw_spin_unlock+0x24/0x30
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? do_vfs_ioctl+0xa5/0x6e0
>> 	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>> 	 do_vfs_ioctl+0xa5/0x6e0
>> 	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>> 	 ksys_ioctl+0x3a/0x70
>> 	 __x64_sys_ioctl+0x16/0x20
>> 	 do_syscall_64+0x60/0x1b0
>> 	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back.
> 
> Just a nitpick, would you please split long paragraphs with newlines?
> 
> It makes reviewers' eyes less painful.
> 
>> This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page.
> 
> 
>> When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range.
> 
> Yes, that's the old assumption, at least well explained by some ascii
> art. (even it's wrong)
> 
>> This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
> 
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?

So in my case the following happened - 2 block groups were created as
delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
dirtied, so when page 0 - 0-4k when into writepage_delalloc,
find_lock_delalloc_range would return the range 0-1m. So the call to
fill_delalloc instantiates OE 0-1m and writeback continues as expected.

Now, when the 2nd page from range 0-1m is again set for writeback and
find_lock_delalloc_range is called with delalloc_start ==  4096 it will
actually return the range 1m-128m. Then fill_delalloc is called with
locked_page belonging to the range which was already instantiated and
the start/end arguments pointing to 1m-128m if an error occurred in
run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
be called which will clear Private2 bit for pages belonging to 1m-128m
range and the OE will be cleared of all but the first page (since the
code wrongly assumed locked_page always belongs to the range currently
being instantiated).

> 
> Even the fix looks solid, it's still better to explain the cause a
> little more, as the more we understand the cause, the better solution we
> may get.
> 
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
> 
> And the solution also looks good to me, and much more robust, without
> any (possibly wrong) assumption.
> 
> Thanks,
> Qu
> 
>>
>> Signed-off-by: Nikolay Borisov <nborisov@xxxxxxxx>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
>> ---
>>
>> V2: 
>>  * Fix compilation failure due to missing parentheses
>>  * Fixed the "Fixes" tag. 
>>
>>  fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>>  1 file changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e1f00d8c24ce..5906564ae2e9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode *inode,
>>   * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
>>   * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
>>   * to be released, which we want to happen only when finishing the ordered
>> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> - * fill_delalloc() callback already does proper cleanup for the first page of
>> - * the range, that is, it invokes the callback writepage_end_io_hook() for the
>> - * range of the first page.
>> + * extent (btrfs_finish_ordered_io()).
>>   */
>>  static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> -						 const u64 offset,
>> -						 const u64 bytes)
>> +						 struct page *locked_page,
>> +						 u64 offset, u64 bytes)
>>  {
>>  	unsigned long index = offset >> PAGE_SHIFT;
>>  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
>> +	u64 page_start = page_offset(locked_page);
>> +	u64 page_end = page_start + PAGE_SIZE - 1;
>> +
>>  	struct page *page;
>>  
>>  	while (index <= end_index) {
>> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>>  		ClearPagePrivate2(page);
>>  		put_page(page);
>>  	}
>> -	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> -					    bytes - PAGE_SIZE, false);
>> +
>> +	/*
>> +	 * In case this page belongs to the delalloc range being instantiated
>> +	 * then skip it, since the first page of a range is going to be
>> +	 * properly cleaned up by the caller of run_delalloc_range
>> +	 */
>> +	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
>> +		offset += PAGE_SIZE;
>> +		bytes -= PAGE_SIZE;
>> +	}
>> +
>> +	return __endio_write_update_ordered(inode, offset, bytes, false);
>>  }
>>  
>>  static int btrfs_dirty_inode(struct inode *inode);
>> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data, struct page *locked_page,
>>  					   write_flags);
>>  	}
>>  	if (ret)
>> -		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>> +		btrfs_cleanup_ordered_extents(inode, locked_page, start,
>> +					      end - start + 1);
>>  	return ret;
>>  }
>>  
>>
> 



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux