Re: [PATCH] Btrfs: fix race leading to metadata space leak after task received signal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 25, 2019 at 10:53:41AM +0100, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
> 
> When a task that is allocating metadata needs to wait for the async
> reclaim job to process its ticket and gets a signal (because it was killed
> for example) before doing the wait, the task ends up erroring out but
> with space reserved for its ticket, which never gets released, resulting
> in a metadata space leak (more specifically a leak in the bytes_may_use
> counter of the metadata space_info object).
> 
> Here's the sequence of steps leading to the space leak:
> 
> 1) A task tries to create a file for example, so it ends up trying to
>    start a transaction at btrfs_create();
> 
> 2) The filesystem is currently in a state where there is not enough
>    metadata free space to satisfy the transaction's needs. So at
>    space-info.c:__reserve_metadata_bytes() we create a ticket and
>    add it to the list of tickets of the space info object. Also,
>    because the metadata async reclaim job is not running, we queue
>    a job ro run metadata reclaim;
> 
> 3) In the meanwhile the task receives a signal (like SIGTERM from
>    a kill command for example);
> 
> 4) After queing the async reclaim job, at __reserve_metadata_bytes(),
>    we unlock the metadata space info and call handle_reserve_ticket();
> 
> 5) That last function calls wait_reserve_ticket(), which acquires the
>    lock from the metadata space info. Then in the first iteration of
>    its while loop, it calls prepare_to_wait_event(), which returns
>    -ERESTARTSYS because the task has a pending signal. As a result,
>    we set the error field of the ticket to -EINTR and exit the while
>    loop without deleting the ticket from the list of tickets (in the
>    space info object). After exiting the loop we unlock the space info;
> 
> 6) The async reclaim job is able to release enough metadata, acquires
>    the metadata space info's lock and then reserves space for the ticket,
>    since the ticket is still in the list of (non-priority) tickets. The
>    space reservation happens at btrfs_try_granting_tickets(), called from
>    maybe_fail_all_tickets(). This increments the bytes_may_use counter
>    from the metadata space info object, sets the ticket's bytes field to
>    zero (meaning success, that space was reserved) and removes it from
>    the list of tickets;
> 
> 7) wait_reserve_ticket() returns, with the error field of the ticket
>    set to -EINTR. Then handle_reserve_ticket() just propagates that error
>    to the caller. Because an error was returned, the caller does not
>    release the reserved space, since the expectation is that any error
>    means no space was reserved.
> 
> Fix this by removing the ticket from the list, while holding the space
> info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
> an error.
> 
> Also add some comments and an assertion to guarantee we never end up with
> a ticket that has an error set and a bytes counter field set to zero, to
> more easily detect regressions in the future.
> 
> This issue could be triggered sporadically by some test cases from fstests
> such as generic/269 for example, which tries to fill a filesystem and then
> kills fsstress processes running in the background.
> 
> When this issue happens, we get a warning in syslog/dmesg when unmounting
> the filesystem, like the following:
> 
>  [Tue Oct  1 17:24:23 2019] ------------[ cut here ]------------
>  [Tue Oct  1 17:24:23 2019] WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
>  (...)
>  [Tue Oct  1 17:24:23 2019] CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
>  [Tue Oct  1 17:24:23 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
>  [Tue Oct  1 17:24:23 2019] RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
>  (...)
>  [Tue Oct  1 17:24:24 2019] RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
>  [Tue Oct  1 17:24:24 2019] RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
>  [Tue Oct  1 17:24:24 2019] RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
>  [Tue Oct  1 17:24:24 2019] RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
>  [Tue Oct  1 17:24:24 2019] R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
>  [Tue Oct  1 17:24:24 2019] R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
>  [Tue Oct  1 17:24:24 2019] FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
>  [Tue Oct  1 17:24:24 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [Tue Oct  1 17:24:24 2019] CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
>  [Tue Oct  1 17:24:24 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  [Tue Oct  1 17:24:24 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  [Tue Oct  1 17:24:24 2019] Call Trace:
>  [Tue Oct  1 17:24:24 2019]  close_ctree+0x1ad/0x390 [btrfs]
>  [Tue Oct  1 17:24:24 2019]  generic_shutdown_super+0x6c/0x110
>  [Tue Oct  1 17:24:24 2019]  kill_anon_super+0xe/0x30
>  [Tue Oct  1 17:24:24 2019]  btrfs_kill_super+0x12/0xa0 [btrfs]
>  [Tue Oct  1 17:24:24 2019]  deactivate_locked_super+0x3a/0x70
>  [Tue Oct  1 17:24:24 2019]  cleanup_mnt+0xb4/0x160
>  [Tue Oct  1 17:24:24 2019]  task_work_run+0x7e/0xc0
>  [Tue Oct  1 17:24:24 2019]  exit_to_usermode_loop+0xfa/0x100
>  [Tue Oct  1 17:24:24 2019]  do_syscall_64+0x1cb/0x220
>  [Tue Oct  1 17:24:24 2019]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>  [Tue Oct  1 17:24:24 2019] RIP: 0033:0x7f4274d2cb37
>  (...)
>  [Tue Oct  1 17:24:24 2019] RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
>  [Tue Oct  1 17:24:24 2019] RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
>  [Tue Oct  1 17:24:24 2019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
>  [Tue Oct  1 17:24:24 2019] RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
>  [Tue Oct  1 17:24:24 2019] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
>  [Tue Oct  1 17:24:24 2019] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
>  [Tue Oct  1 17:24:24 2019] irq event stamp: 0
>  [Tue Oct  1 17:24:24 2019] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>  [Tue Oct  1 17:24:24 2019] hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
>  [Tue Oct  1 17:24:24 2019] softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
>  [Tue Oct  1 17:24:24 2019] softirqs last disabled at (0): [<0000000000000000>] 0x0
>  [Tue Oct  1 17:24:24 2019] ---[ end trace bcf4b235461b26f6 ]---
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): space_info 4 has 19116032 free, is full
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
>  [Tue Oct  1 17:24:24 2019] BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
> 
> Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>

Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>

Thanks,

Josef



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux