Re: [RFC][PATCH] tracepoints: Move the work out of line from hotpath sections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


* Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
> With the recent talk of performance regressions, I decided to take a
> look a tracepoints. For those that are not familiar with them, they are
> static locations scattered throughout the kernel that let developers
> trace their code when needed, with very little overhead when not needed
> but configured in the kernel. Most distributions have tracepoints
> configured into their kernels.
> 
> They look like this:
> 
> 	trace_sched_switch(prev, next);
> 
> When tracing is configured, through macro magic, this simple line
> expands to something like this:
> 
> static inline void trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	if (static_key_false(&__tracepoint_sched_switch.key)) {
> 		struct tracepoint_func *it_func_ptr;
> 		void *it_func;
> 		void *__data;
> 
> 		rcu_read_lock_sched_notrace();
> 		it_func_ptr = rcu_dereference_sched((*__tracepoint_sched_switch)->funcs);
> 		if (it_func_ptr) {
> 			do {
> 				it_func = (it_func_ptr)->func;
> 				__data = (it_func_ptr)->data;
> 				((it_func))(__data, prev, next);
> 			} while ((++it_func_ptr)->func);
> 		}
> 		rcu_read_unlock_sched_notrace();
> 	}
> }
> 
> There's some more magic involved here. The static_key_false() uses
> jump_labels to make the contents of the if block "unlikely" as well as
> uses a nop in the code when not enabled.
> 
> The call site looks like this:
> 
>        48 8b b5 10 ff ff ff    mov    -0xf0(%rbp),%rsi
>        c7 46 28 01 00 00 00    movl   $0x1,0x28(%rsi)
>        66 66 66 66 90          nopl
>        48 8b b5 10 ff ff ff    mov    -0xf0(%rbp),%rsi
>        48 8b bd 18 ff ff ff    mov    -0xe8(%rbp),%rdi
> 
> When tracing is enabled, the nopl is converted into a jump to the
> tracing code that sits at the end of the function (__schedule in this
> case). That code looks like this:
> 
>        65 48 8b 04 25 30 ba    mov    %gs:0xba30,%rax
>        00 00
>        83 80 44 e0 ff ff 01    addl   $0x1,-0x1fbc(%rax)
>        4c 8b 2d b9 d3 80 00    mov    0x80d3b9(%rip),%r13        # ffffffff8257dff0 <__tracepoint_sched_switch+0x30>
>        4d 85 ed                test   %r13,%r13
>        74 54                   je     ffffffff81d70c90 <__schedule+0x510>
>        4d 8d 75 10             lea    0x10(%r13),%r14
>        49 8b 45 00             mov    0x0(%r13),%rax
>        4d 89 ec                mov    %r13,%r12
>        48 89 9d f0 fe ff ff    mov    %rbx,-0x110(%rbp)
>        4c 8b bd 18 ff ff ff    mov    -0xe8(%rbp),%r15
>        4c 89 f3                mov    %r14,%rbx
>        4d 89 ee                mov    %r13,%r14
>        4c 8b ad 10 ff ff ff    mov    -0xf0(%rbp),%r13
>        66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
>        49 8b 7c 24 08          mov    0x8(%r12),%rdi
>        49 83 c4 10             add    $0x10,%r12
>        4c 89 ea                mov    %r13,%rdx
>        4c 89 fe                mov    %r15,%rsi
>        ff d0                   callq  *%rax
>        4c 89 e0                mov    %r12,%rax
>        4c 29 f0                sub    %r14,%rax
>        48 8b 44 03 f0          mov    -0x10(%rbx,%rax,1),%rax
>        48 85 c0                test   %rax,%rax
>        75 df                   jne    ffffffff81d70c68 <__schedule+0x4e8>
>        48 8b 9d f0 fe ff ff    mov    -0x110(%rbp),%rbx
>        65 48 8b 04 25 30 ba    mov    %gs:0xba30,%rax
>        00 00
>        83 a8 44 e0 ff ff 01    subl   $0x1,-0x1fbc(%rax)
>        48 8b 80 38 e0 ff ff    mov    -0x1fc8(%rax),%rax
>        a8 08                   test   $0x8,%al
>        0f 84 e3 fd ff ff       je     ffffffff81d70a92 <__schedule+0x312>
>        e8 7c 05 00 00          callq  ffffffff81d71230 <preempt_schedule>
>        e9 d9 fd ff ff          jmpq   ffffffff81d70a92 <__schedule+0x312>
> 
> Now even though this code is moved to the end of the function, it still
> sits in the middle of lots of hot paths. All tracepoints act this way,
> and the kernel is growing rapidly with them. The latest count has
> something like 799 events existing currently in the kernel.
> 
> Now I figured, instead of doing all this work in the hot paths, why not
> call a function that is out of line from the hot paths. Something like
> this:
> 
> static inline void trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	if (static_key_false(&__tracepoint_sched_switch.key)) {
> 		extern void __tracepoint_hook_sched_switch(struct task_struct *prev, struct task_struct *next);
> 
> 		__tracepoint_hook_sched_switch(prev, next);
> 	}
> }
> 
> 
> And out of line (in the .text.unlikely section) we would have:
> 
> void __tracepoint_hook_sched_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	struct tracepoint_func *it_func_ptr;
> 	void *it_func;
> 	void *__data;
> 
> 	rcu_read_lock_sched_notrace();
> 	it_func_ptr = rcu_dereference_sched((*__tracepoint_sched_switch)->funcs);
> 	if (it_func_ptr) {
> 		do {
> 			it_func = (it_func_ptr)->func;
> 			__data = (it_func_ptr)->data;
> 			((it_func))(__data, prev, next);
> 		} while ((++it_func_ptr)->func);
> 	}
> 	rcu_read_unlock_sched_notrace();
> }
> 
> 
> This makes what exists in the hot path files much smaller:
> 
>        48 8b b5 10 ff ff ff    mov    -0xf0(%rbp),%rsi
>        48 8b bd 18 ff ff ff    mov    -0xe8(%rbp),%rdi
>        e8 dd f8 fc ff          callq  ffffffff81d40450 <__tracepoint_hook_sched_switch>
>        e9 5a fe ff ff          jmpq   ffffffff81d709d2 <__schedule+0x312>
> 
> 
> Now, by doing this, it adds a bit more code. Although, the code that is
> added exists in the .text.unlikely section and the normal paths should
> actually shrink, but kernel bloat is added never-the-less.
> 
>    text	   data	    bss	    dec	    hex	filename
> 20016471	2594648	1945600	24556719	176b4af	vmlinux-before
> 20032535	2599960	1945600	24578095	177082f	vmlinux-after
> 
> It adds 20K to the code. Well, it's not really this much, because I made
> this kernel pull in all modules (including ext3, xfs and kvm) which all
> add quite a number of tracepoints. But it was better to do this to get a
> picture of how this affects the tracepoints. There were 707 tracepoints
> compiled in, making it roughly 30 bytes per tracepoint added. Which
> makes sense, since the code use to be embedded at the site, but now
> there's a call to another function to do the work.
> 
> I haven't ran any good tests to see how much of a performance
> improvement this does, but the patch is here and you can try it out and
> see for yourself.
> 
> Pros:
>  Less code in the hot paths.
>  Hopefully, performance improves
> 
> Cons:
>  Adds some bloat to the kernel (but in an out-of-the-way section)
>  Slows down actual tracing as it needs to call a function and will
>    have more cache misses during the trace.
> 
> This patch just got the code working. I need to update the comments that
> go with tracepoints and fix the tracepoint samples file. But I wanted to
> get peoples reactions before putting any more effort into this.
> 
> Should I go ahead with this or not?

I guess the very first thing to do would be to benchmark this patch
thoroughly to see if it brings significant performance improvements to
the "tracing built-in, not enabled" case. If it does bring a significant
improvement, then we can consider the overhead on tracing, and see if it
makes sense to add this extra indirection on the tracing hot path.

About your "Pro" point "Less code in the hot paths", it remains to be
seen, overall, how many bytes of the removed instructions actually sit
on hot cache lines and if there is any effect on the TLB, especially
given that this "maybe-not-so-hot" code is in an unlikely branch.

It might be better to improve gcc to move really cold branches out of
line (really, really far away), and use the compiler to do this, rather
than to use an extra indirection that adds bloat and complexity to the
kernel.

Thanks,

Mathieu

> 
> -- Steve
> 
> Index: linux-trace.git/include/linux/tracepoint.h
> ===================================================================
> --- linux-trace.git.orig/include/linux/tracepoint.h
> +++ linux-trace.git/include/linux/tracepoint.h
> @@ -114,25 +114,35 @@ static inline void tracepoint_synchroniz
>   * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
>   * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
>   */
> -#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu)		\
> -	do {								\
> +#define __DO_TRACE(name, proto, args, data_proto, data_args)		\
> +	void __attribute__((section(".text.unlikely")))			\
> +	__tracepoint_hook_##name(proto)					\
> +	{								\
>  		struct tracepoint_func *it_func_ptr;			\
> +		struct tracepoint *tp = &__tracepoint_##name;		\
>  		void *it_func;						\
>  		void *__data;						\
>  									\
> -		if (!(cond))						\
> -			return;						\
> -		prercu;							\
>  		rcu_read_lock_sched_notrace();				\
>  		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
>  		if (it_func_ptr) {					\
>  			do {						\
>  				it_func = (it_func_ptr)->func;		\
>  				__data = (it_func_ptr)->data;		\
> -				((void(*)(proto))(it_func))(args);	\
> +				((void(*)(data_proto))(it_func))(data_args); \
>  			} while ((++it_func_ptr)->func);		\
>  		}							\
>  		rcu_read_unlock_sched_notrace();			\
> +	}
> +
> +#define __DO_TRACE_HOOK(name, proto, args, cond, prercu, postrcu)	\
> +	do {								\
> +		extern void __tracepoint_hook_##name(proto);		\
> +									\
> +		if (!(cond))						\
> +			return;						\
> +		prercu;							\
> +		__tracepoint_hook_##name(args);				\
>  		postrcu;						\
>  	} while (0)
>  
> @@ -141,22 +151,22 @@ static inline void tracepoint_synchroniz
>   * not add unwanted padding between the beginning of the section and the
>   * structure. Force alignment to the same alignment as the section start.
>   */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
>  	extern struct tracepoint __tracepoint_##name;			\
>  	static inline void trace_##name(proto)				\
>  	{								\
>  		if (static_key_false(&__tracepoint_##name.key))		\
> -			__DO_TRACE(&__tracepoint_##name,		\
> -				TP_PROTO(data_proto),			\
> -				TP_ARGS(data_args),			\
> +			__DO_TRACE_HOOK(name,				\
> +				TP_PROTO(proto),			\
> +				TP_ARGS(args),				\
>  				TP_CONDITION(cond),,);			\
>  	}								\
>  	static inline void trace_##name##_rcuidle(proto)		\
>  	{								\
>  		if (static_key_false(&__tracepoint_##name.key))		\
> -			__DO_TRACE(&__tracepoint_##name,		\
> -				TP_PROTO(data_proto),			\
> -				TP_ARGS(data_args),			\
> +			__DO_TRACE_HOOK(name,				\
> +				TP_PROTO(proto),			\
> +				TP_ARGS(args),				\
>  				TP_CONDITION(cond),			\
>  				rcu_idle_exit(),			\
>  				rcu_idle_enter());			\
> @@ -183,18 +193,31 @@ static inline void tracepoint_synchroniz
>   * structures, so we create an array of pointers that will be used for iteration
>   * on the tracepoints.
>   */
> -#define DEFINE_TRACE_FN(name, reg, unreg)				 \
> -	static const char __tpstrtab_##name[]				 \
> -	__attribute__((section("__tracepoints_strings"))) = #name;	 \
> -	struct tracepoint __tracepoint_##name				 \
> -	__attribute__((section("__tracepoints"))) =			 \
> +#define __DEFINE_TRACE_FN(name, proto, args, data_proto, data_args, reg, unreg) \
> +	static const char __tpstrtab_##name[]				\
> +	__attribute__((section("__tracepoints_strings"))) = #name;	\
> +	struct tracepoint __tracepoint_##name				\
> +	__attribute__((section("__tracepoints"))) =			\
>  		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
> -	static struct tracepoint * const __tracepoint_ptr_##name __used	 \
> -	__attribute__((section("__tracepoints_ptrs"))) =		 \
> -		&__tracepoint_##name;
> +	static struct tracepoint * const __tracepoint_ptr_##name __used	\
> +	__attribute__((section("__tracepoints_ptrs"))) =		\
> +		&__tracepoint_##name;					\
> +	__DO_TRACE(name,						\
> +		   TP_PROTO(proto),					\
> +		   TP_ARGS(args),					\
> +		   TP_PROTO(data_proto),				\
> +		   TP_ARGS(data_args))
> +
> +#define DEFINE_TRACE_FN_NOARGS(name, proto, args)		\
> +	__DEFINE_TRACE_FN(name, void, , void *__data, __data, , )
> +
> +#define DEFINE_TRACE_FN(name, proto, args, reg, unreg)		\
> +	__DEFINE_TRACE_FN(name, PARAMS(proto), PARAMS(args),	\
> +			  PARAMS(void *__data, proto),		\
> +			  PARAMS(__data, args), reg, unreg)
>  
> -#define DEFINE_TRACE(name)						\
> -	DEFINE_TRACE_FN(name, NULL, NULL);
> +#define DEFINE_TRACE(name, proto, args)					\
> +	DEFINE_TRACE_FN(name, PARAMS(proto), PARAMS(args), NULL, NULL);
>  
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)				\
>  	EXPORT_SYMBOL_GPL(__tracepoint_##name)
> @@ -202,7 +225,7 @@ static inline void tracepoint_synchroniz
>  	EXPORT_SYMBOL(__tracepoint_##name)
>  
>  #else /* !CONFIG_TRACEPOINTS */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
>  	static inline void trace_##name(proto)				\
>  	{ }								\
>  	static inline void trace_##name##_rcuidle(proto)		\
> @@ -223,8 +246,8 @@ static inline void tracepoint_synchroniz
>  	{								\
>  	}
>  
> -#define DEFINE_TRACE_FN(name, reg, unreg)
> -#define DEFINE_TRACE(name)
> +#define DEFINE_TRACE_FN(name, proto, args, reg, unreg)
> +#define DEFINE_TRACE(name, proto, args)
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
>  #define EXPORT_TRACEPOINT_SYMBOL(name)
>  
> @@ -245,17 +268,15 @@ static inline void tracepoint_synchroniz
>   * "void *__data, proto" as the callback prototype.
>   */
>  #define DECLARE_TRACE_NOARGS(name)					\
> -		__DECLARE_TRACE(name, void, , 1, void *__data, __data)
> +		__DECLARE_TRACE(name, void, , 1, void *__data)
>  
>  #define DECLARE_TRACE(name, proto, args)				\
>  		__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), 1,	\
> -				PARAMS(void *__data, proto),		\
> -				PARAMS(__data, args))
> +				PARAMS(void *__data, proto))
>  
>  #define DECLARE_TRACE_CONDITION(name, proto, args, cond)		\
>  	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), PARAMS(cond), \
> -			PARAMS(void *__data, proto),			\
> -			PARAMS(__data, args))
> +			PARAMS(void *__data, proto))
>  
>  #define TRACE_EVENT_FLAGS(event, flag)
>  
> Index: linux-trace.git/include/trace/define_trace.h
> ===================================================================
> --- linux-trace.git.orig/include/trace/define_trace.h
> +++ linux-trace.git/include/trace/define_trace.h
> @@ -24,7 +24,7 @@
>  
>  #undef TRACE_EVENT
>  #define TRACE_EVENT(name, proto, args, tstruct, assign, print)	\
> -	DEFINE_TRACE(name)
> +	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
>  
>  #undef TRACE_EVENT_CONDITION
>  #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
> @@ -38,15 +38,15 @@
>  #undef TRACE_EVENT_FN
>  #define TRACE_EVENT_FN(name, proto, args, tstruct,		\
>  		assign, print, reg, unreg)			\
> -	DEFINE_TRACE_FN(name, reg, unreg)
> +	DEFINE_TRACE_FN(name, PARAMS(proto), PARAMS(args), reg, unreg)
>  
>  #undef DEFINE_EVENT
>  #define DEFINE_EVENT(template, name, proto, args) \
> -	DEFINE_TRACE(name)
> +	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
>  
>  #undef DEFINE_EVENT_PRINT
>  #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
> -	DEFINE_TRACE(name)
> +	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
>  
>  #undef DEFINE_EVENT_CONDITION
>  #define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
> @@ -54,7 +54,7 @@
>  
>  #undef DECLARE_TRACE
>  #define DECLARE_TRACE(name, proto, args)	\
> -	DEFINE_TRACE(name)
> +	DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))
>  
>  #undef TRACE_INCLUDE
>  #undef __TRACE_INCLUDE
> Index: linux-trace.git/samples/tracepoints/tracepoint-sample.c
> ===================================================================
> --- linux-trace.git.orig/samples/tracepoints/tracepoint-sample.c
> +++ linux-trace.git/samples/tracepoints/tracepoint-sample.c
> @@ -1,4 +1,5 @@
>  /* tracepoint-sample.c
> + * FIX ME
>   *
>   * Executes a tracepoint when /proc/tracepoint-sample is opened.
>   *
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Other Archives]     [Linux Kernel Newbies]     [Linux Driver Development]     [Linux Kbuild]     [Fedora Kernel]     [Linux Kernel Testers]     [Linux SH]     [Linux Omap]     [Linux Tape]     [Linux Input]     [Linux Kernel Janitors]     [Linux Kernel Packagers]     [Linux Doc]     [Linux Man Pages]     [Linux API]     [Linux Memory Management]     [Linux Modules]     [Linux Standards]     [Kernel Announce]     [Netdev]     [Git]     [Linux PCI]     Linux CAN Development     [Linux I2C]     [Linux RDMA]     [Linux NUMA]     [Netfilter]     [Netfilter Devel]     [SELinux]     [Bugtraq]     [FIO]     [Linux Perf Users]     [Linux Serial]     [Linux PPP]     [Linux ISDN]     [Linux Next]     [Kernel Stable Commits]     [Linux Tip Commits]     [Kernel MM Commits]     [Linux Security Module]     [AutoFS]     [Filesystem Development]     [Ext3 Filesystem]     [Linux bcache]     [Ext4 Filesystem]     [Linux BTRFS]     [Linux CEPH Filesystem]     [Linux XFS]     [XFS]     [Linux NFS]     [Linux CIFS]     [Ecryptfs]     [Linux NILFS]     [Linux Cachefs]     [Reiser FS]     [Initramfs]     [Linux FB Devel]     [Linux OpenGL]     [DRI Devel]     [Fastboot]     [Linux RT Users]     [Linux RT Stable]     [eCos]     [Corosync]     [Linux Clusters]     [LVS Devel]     [Hot Plug]     [Linux Virtualization]     [KVM]     [KVM PPC]     [KVM ia64]     [Linux Containers]     [Linux Hexagon]     [Linux Cgroups]     [Util Linux]     [Wireless]     [Linux Bluetooth]     [Bluez Devel]     [Ethernet Bridging]     [Embedded Linux]     [Barebox]     [Linux MMC]     [Linux IIO]     [Sparse]     [Smatch]     [Linux Arch]     [x86 Platform Driver]     [Linux ACPI]     [Linux IBM ACPI]     [LM Sensors]     [CPU Freq]     [Linux Power Management]     [Linmodems]     [Linux DCCP]     [Linux SCTP]     [ALSA Devel]     [Linux USB]     [Linux PA RISC]     [Linux Samsung SOC]     [MIPS Linux]     [IBM S/390 Linux]     [ARM Linux]     [ARM Kernel]     [ARM MSM]     [Tegra Devel]     [Sparc Linux]     [Linux Security]     [Linux Sound]     [Linux Media]     [Video 4 Linux]     [Linux IRDA Users]     [Linux for the blind]     [Linux RAID]     [Linux ATA RAID]     [Device Mapper]     [Linux SCSI]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Linux IDE]     [Linux SMP]     [Linux AXP]     [Linux Alpha]     [Linux M68K]     [Linux ia64]     [Linux 8086]     [Linux x86_64]     [Linux Config]     [Linux Apps]     [Linux MSDOS]     [Linux X.25]     [Linux Crypto]     [DM Crypt]     [Linux Trace Users]     [Linux Btrace]     [Linux Watchdog]     [Utrace Devel]     [Linux C Programming]     [Linux Assembly]     [Dash]     [DWARVES]     [Hail Devel]     [Linux Kernel Debugger]     [Linux gcc]     [Gcc Help]     [X.Org]     [Wine]

Add to Google Powered by Linux

[Older Kernel Discussion]     [Yosemite National Park Forum]     [Large Format Photos]     [Gimp]     [Yosemite Photos]     [Stuff]