I'll look into this after I figure out the cgroups scheduler thing I'm looking at. Thanks,
Josef
Sent from my iPhone
> On May 12, 2017, at 5:41 AM, Nikolay Borisov <nborisov@xxxxxxxx> wrote:
>
> Hello,
>
> I've been investigating a severe performance regression in generic/027 on
> latest master. Specifically I traced that back to the rm portion of the test.
> rm now takes 4 times longer to delete the same amount of files. After bisecting I
> identified the culprit as being 957780eb2788 ("Btrfs: introduce
> ticketed enospc infrastructure"). With this commit applied rm
> would take 65 seconds to delete around 10k files whereas before
> it would take 15 seconds. I guess this is only triggered in the
> case where there isn't much available metadata space since no one
> has complained so far.
>
> Further analysis showed that the extra time comes from the fact there
> are a lot more transaction commits happening with the ticketed enosp.
> Here are some stats obtained via systemtap instrumentation of flush_space:
>
> Good:
> btrfs_trans_commit: 1088
> FLUSH_DELAYED_ITEMS_NR => 12281
> FLUSH_DELAYED_ITEMS => 6151
> FLUSH_DELALLOC => 6124
> FLUSH_DELALLOC_WAIT => 6124
> ALLOC_CHUNK => 6124
> COMMIT_TRANS => 1087
>
> BAD:
> btrfs_trans_commit: 4711
> FLUSH_DELAYED_ITEMS_NR => 9606
> FLUSH_DELAYED_ITEMS => 4709
> FLUSH_DELALLOC => 4709
> FLUSH_DELALLOC_WAIT => 4709
> ALLOC_CHUNK => 4709
> COMMIT_TRANS => 4709
>
> So we do 4 times more transaction commits. I've also traced the
> 2 branches in reserve_metadata_bytes (__reserve_metadata_bytes) respectively,
> and here are the numbers:
>
> Good:
> Failures: 9383
>
> Bad:
> Failures: 1579
>
> Those numbers count how many times have we took the
> if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
>
> branch in the good case and how many times we executed
> the 'queue_work' function call in the bad case i.e. the
> numbers of time we have to flush_space. Here are the stack
> traces of btrfs_commit_trans:
>
> Good:
> 0xffffffff81387ca0 : btrfs_commit_transaction+0x0/0xb40 [kernel]
> 0xffffffff813726eb : flush_space+0x9b/0x4e0 [kernel]
> 0xffffffff81372fe4 : reserve_metadata_bytes+0x2f4/0x710 [kernel]
> 0xffffffff81373425 : btrfs_block_rsv_add+0x25/0x60 [kernel]
> 0xffffffff81388bd3 : start_transaction+0x3f3/0x570 [kernel]
> 0xffffffff813891fa : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 [kernel]
> 0xffffffff81392785 : btrfs_unlink+0x35/0xb0 [kernel]
> 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel]
> 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel]
> 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel]
> 0xffffffff816d0b40 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel]
>
> Naturally.
>
> Bad:
>
> 0xffffffff81388b30 : btrfs_commit_transaction+0x0/0xb40 [kernel]
> 0xffffffff81372d73 : flush_space+0x93/0x510 [kernel]
> 0xffffffff813732f3 : btrfs_async_reclaim_metadata_space+0x103/0x460 [kernel]
> 0xffffffff8106d86f : process_one_work+0x1cf/0x660 [kernel]
> 0xffffffff8106dd4e : worker_thread+0x4e/0x490 [kernel]
> 0xffffffff810745ae : kthread+0xfe/0x120 [kernel]
> 0xffffffff816d1c0f : ret_from_fork+0x1f/0x40 [kernel]
> 0xffffffff810744b0 : kthread+0x0/0x120 [kernel]
>
> Also no surprises here, since flushing is alway async now. Further breaking
> down which site actually queued the btrfs_async_reclaim_metadata_space
> reveals the following:
>
> traces[
> 0xffffffff81373774 : __reserve_metadata_bytes+0x124/0xa80 [kernel]
> 0xffffffff813740f4 : reserve_metadata_bytes+0x24/0x1c0 [kernel]
> 0xffffffff813742b5 : btrfs_block_rsv_add+0x25/0x60 [kernel]
> 0xffffffff81389a63 : start_transaction+0x3f3/0x570 [kernel]
> 0xffffffff8138a08a : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 [kernel]
> 0xffffffff81393615 : btrfs_unlink+0x35/0xb0 [kernel]
> 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel]
> 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel]
> 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel]
> 0xffffffff816d19c0 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel]
> ] = 3905
>
> __reserve_metadata_bytes+0x124/0xa80 is the queue_work from the
> if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) branch
> e.g. when we managed to do our reservation but still would like to
> do preemptive space flushing.
>
>
>
> traces[
> 0xffffffff81373eed : __reserve_metadata_bytes+0x89d/0xa80 [kernel]
> 0xffffffff813740f4 : reserve_metadata_bytes+0x24/0x1c0 [kernel]
> 0xffffffff813742b5 : btrfs_block_rsv_add+0x25/0x60 [kernel]
> 0xffffffff81389a63 : start_transaction+0x3f3/0x570 [kernel]
> 0xffffffff8138a08a : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 [kernel]
> 0xffffffff81393615 : btrfs_unlink+0x35/0xb0 [kernel]
> 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel]
> 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel]
> 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel]
> 0xffffffff816d19c0 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel]
> ] = 1579
>
> __reserve_metadata_bytes+0x89d/0xa80 this is the queue from the
> if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { branch. E.g. when we
> failed to do our reservation.
>
> SO with the ticketed ENOSPC we actually have less reservation failures but
> ironically a lot more transaction commits. The end result being slower operations.
>
> I'm also sharing two files which have a lot more debugging by printing
> the space_info state during failure to do reservations in both the good
> case (pre enospc ticketed work) and the bad (after ticketed enospc) case:
>
> good: https://urldefense.proofpoint.com/v2/url?u=http-3A__sprunge.us_QBSc&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=pnE03zO5T9m4F56Foc6ZXiJTwxpINKgJU3Uxubyd9v4&s=ff9L8y28iZTOan9sR9eASmJLuUEMvOk6O566VYWllpw&e=
> bad: https://urldefense.proofpoint.com/v2/url?u=http-3A__sprunge.us_PaNI&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=pnE03zO5T9m4F56Foc6ZXiJTwxpINKgJU3Uxubyd9v4&s=RJBYXXgpK75teV_6E1Ljy1MiWs9ate_IolT6O294Bb8&e=
>
> I'm running out of ideas where this might be happening or how to further
> debug it. Any pointers would be much appreciated. I've had an off-list
> discussion with Josef re. this and one thing he suggested is that we
> might be hitting the ->failfast in use_block_rsv however in both
> cases that code was never triggered. All measurements have been made
> only during the rm portion of the test and not the whole "Create files and
> delete them" phase.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html