On 11 Sep 2019, at 15:55, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> So fix this by not overwriting the return value (ret) with the result
> from flush_write_bio(). We also need to clear the
> EXTENT_BUFFER_WRITEBACK
> bit in case flush_write_bio() returns an error, otherwise it will hang
> any future attempts to writeback the extent buffer.
>
> This is a regression introduced in the 5.2 kernel.
>
> Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to
> lock_extent_buffer_for_io()")
> Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in
> flush_write_bio() one level up")
> Reported-by: Zdenek Sojka <zsojka@xxxxxxxxx>
> Link:
> https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@xxxxxxxxx/T/#u
> Reported-by: Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx>
> Link:
> https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@xxxxxxxxxxxx/T/#t
> Reported-by: Drazen Kacar <drazen.kacar@xxxxxxxxxxx>
> Link:
> https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
> ---
> fs/btrfs/extent_io.c | 23 ++++++++++++++---------
> 1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1ff438fd5bc2..1311ba0fc031 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3628,6 +3628,13 @@ void wait_on_extent_buffer_writeback(struct
> extent_buffer *eb)
> TASK_UNINTERRUPTIBLE);
> }
>
> +static void end_extent_buffer_writeback(struct extent_buffer *eb)
> +{
> + clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> + smp_mb__after_atomic();
> + wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
> +}
> +
> /*
> * Lock eb pages and flush the bio if we can't the locks
> *
> @@ -3699,8 +3706,11 @@ static noinline_for_stack int
> lock_extent_buffer_for_io(struct extent_buffer *eb
>
> if (!trylock_page(p)) {
> if (!flush) {
> - ret = flush_write_bio(epd);
> - if (ret < 0) {
> + int err;
> +
> + err = flush_write_bio(epd);
> + if (err < 0) {
> + ret = err;
> failed_page_nr = i;
> goto err_unlock;
> }
Dennis (cc'd) has been trying a similar fix against this in production,
but sending it was interrupted by plumbing conferences. I think he
found that it needs to undo this as well:
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
-eb->len,
fs_info->dirty_metadata_batch);
With the IO errors, we should end up aborting the FS. This function
also flips the the extent buffer written and dirty flags, and his patch
resets them as well. Given that we're aborting anyway, it's not
critical, but it's probably a good idea to fix things up in the goto
err_unlock just to make future bugs less likely.
-chris