On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote:
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:
Call Trace:
__schedule+0x2d4/0xae0
schedule+0x3d/0x90
btrfs_start_ordered_extent+0x160/0x200 [btrfs]
? wake_atomic_t_function+0x60/0x60
btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
[CAUSE]
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>| |<---------- cleanup range --------->|
||
\_=> First page handled by end_extent_writepage() in
__extent_writepage()
The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.
However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.
So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.
[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.
The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.
After fix, delalloc error will be handled like:
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<-------- ----------->|<------ old error handler --------->|
|| ||
|| \_=> Cleaned up by cleanup_ordered_extents()
\_=> First page handled by end_extent_writepage() in
__extent_writepage()
Looks good, some nitpicks below.
Signed-off-by: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>
Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
---
changelog:
v2:
Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
outstanding extents, which is already done by
extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
v3:
Skip first page to avoid underflow ordered->bytes_left.
Fix range passed in cow_file_range() which doesn't cover the whole
extent.
Expend extent_clear_unlock_delalloc() range to allow them to handle
metadata release.
v4:
Don't use extra bit to skip metadata freeing for ordered extent,
but only handle btrfs_reloc_clone_csums() error just before processing
next extent.
This makes error handle much easier for run_delalloc_nocow().
v5:
Variant gramma and comment fixes suggested by Filipe Manana
Enhanced commit message to focus on the generic error handler bug,
pointed out by Filipe Manana
Adding missing wq/func user in __endio_write_update_ordered(), pointed
out by Filipe Manana.
Enhanced commit message with ascii art to explain the bug more easily.
Fix a bug which can leads to corrupted extent allocation, exposed by
Filipe Manana.
v6:
Split the metadata underflow fix to another patch.
Use better comment and btrfs_cleanup_ordered_extent() implementation
from Filipe Manana.
---
fs/btrfs/inode.c | 62
++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 47 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1d83d504f2e5..9b03eb9b27d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode
*inode, u64 start, u64 len,
u64 ram_bytes, int compress_type,
int type);
+static void __endio_write_update_ordered(struct inode *inode,
+ u64 offset, u64 bytes,
+ bool uptodate);
+
+/*
+ * Cleanup all submitted ordered extents in specified range to handle
errors
+ * from the fill_dellaloc() callback.
+ *
+ * NOTE: caller must ensure that when an error happens, it can not call
+ * extent_clear_unlock_delalloc() to clear both the bits
EXTENT_DO_ACCOUNTING
+ * and EXTENT_DELALLOC simultaneously, because that causes the reserved
metadata
+ * to be released, which we want to happen only when finishing the
ordered
+ * extent (btrfs_finish_ordered_io()). Also note that the caller of the
+ * fill_dealloc() callback already does proper cleanup for the first
page of
fill_delalloc()
+ * the range, that is, it invokes the callback writepage_end_io_hook()
for the
+ * range of the first page.
+ */
+static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
+ u64 offset, u64 bytes)
+{
+ return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
+ bytes - PAGE_SIZE, false);
+}
+
static int btrfs_dirty_inode(struct inode *inode);
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode,
struct page *locked_page,
ret = cow_file_range_async(inode, locked_page, start,
end,
page_started, nr_written);
}
+ if (ret)
+ btrfs_cleanup_ordered_extents(inode, start, end - start +
1);
return ret;
}
@@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio
*bio)
bio_put(bio);
}
-static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
- const u64 offset,
- const u64 bytes,
- const int uptodate)
+static void __endio_write_update_ordered(struct inode *inode,
+ u64 offset, u64 bytes,
+ bool uptodate)
Not serious, but why const was killed?