Junseok's bug report[1] about stack tracer on arm64 has exposed additional issues to me. I tried to fix them for the last few weeks, and submitted several RFC/patches. But the problems seem to be more complicated than I first imagined. So, before taking further steps, I'd like you to have time to read the following document and give me your thoughts on it. Some of my ideas, particularly a function prologue analyzer, are difficult to implement, and may have no guarantee to handle all the cases correctly. Some may have impact on code size (and extra instructions at runtime). I think that we should find out the balance of accuracy/conveniences of this tool (stack tracer) and added complexities/open issues. -Takahiro AKASHI Table of Contents ================= I. basic flow of stack tracer II. issues on the naive implementation III.my approach and open issues I. basic flow of stack tracer ------------------------------ Every C function invokes a ftrace hook routine, _mcount or ftrace_caller, in its function prologue. step-1.If the current stack pointer indicates that we have a new record of maximum stack usage, then do the followings. step-2.call arch-specific save_stack_trace() to collect a list of functions in current call chains (we name it a stack dump list here.) step-3.For each function in a stack dump list, compare it against every 4bytes of integer on the stack data. step-4.If matched, we recognize the matched location in the stack as a current stack pointer for this function (for example, idxB for func_B:pcB) step-5.calculate a "stack index" for this function as <top address of stack> - <stack pointer> and record the value into dump_stack_index[]. LOW | | rspA | | | ... <----------- static/dynamic local variables in func_A | | rbpA +--------+ func_A (pcA, rbpA, rspA) | rbpB | idxB +--------+ | pcB | | ... <----------- extra function args to func_A rspB + - - - -+ | | | ... <----------- static/dynamic local variables in func_B | | rbpB +--------+ func_B (pcB, rbpB, rspB) | rbpC | idxC +--------+ | pcC | | ... <----------- extra function args to func_B rspC + - - - -+ | | | ... <----------- static/dynamic local variables in func_C | | rbpC +--------+ func_C (pcC, rbpC, rspC) | | | ... | top + - - - -+ HIGH * In the figure above, rbp(frame pointer) and rsp(stack pointer) point to respective addresses at the time after a function prologue has been done but before a callee function is called. Later on, "cat /sys/kernel/debug/tracing/stack_trace" shows the latest list of stack dump with sizes for each function. - stack size for func_B, for example, is calculated as <stack size> = (top - idxB) - (top - idxC) II. issues on the naive implementation -------------------------------------- By "naive implementation," I mean how the mainline kernel without any modification behaves with ftrace-based stack tracer. (Currently, there is no arch-specific code on arm64 regarding stack tracer.) 1) slurping stack Assume that a given function A was invoked, and it was invoked again in another context, then it called another function B which allocated a large length of local variable on the stack, but it has not modified the variable yet. When stack tracer, in step-3, examines the stack looking for B, then A, we may have a chance to accidentally find a staled, not current, stack frame for A because the old frame might reside on the memory for the variable which has not been overwritten. a. The stack_trace output may have staled entries. 2) differences between x86 and arm64 On x86, "call" instruction automatically pushes a return address on the top of the stack and decrements a stack pointer. Then child function allocates its local variables on the stack. On arm64, a child function decreases reserve memory for local variables, pushes a return address (LR) and old frame pointer then updates a frame pointer to stack pointer. So step-3/4 seems OK on x86, but on arm64, an estimated stack pointer, idxB in the picture below, is not appropriate for our purpose because it contains child function's "local variables." We should instead use spB, if possible, for better result. a. Stack size for a function in stack_trace output is inaccurate, or rather wrong. (It seems as if <Size> field is one-line offset against <Location>.) LOW | ... | fpA +--------+ func_A (pcA, fpA, spA) | fpB | idxB + - - - -+ | pcB | | ... <----------- static local variables in func_A | ... | and extra function args to func_A spB + - - - -+ | ... <----------- dynamically allocated variables in func_B fpB +--------+ func_B (pcB, fpB, spB) | fpC | idxC + - - - -+ | pcC | | ... <----------- static local variables in func_B | ... | and extra function args to func_B spC + - - - -+ | ... | fpC +--------+ func_C (pcC, fpC, spC) HIGH | | 3) Interrupted frame We miss one or two functions when an interrupt occurs. Assume that func_D calls func_C, func_C calls func_B, func_B is interrupted by a device, el1_irq calls func_A, and so on. a.Since el1_irq (and other exception vectors) doesn't update a frame pointer (x29), the current wind_stackframe() tracks only func_A -> el1_irq -> func_C -> func_D and always misses func_B in a stack dump list. LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpB | | pcI | | ... | spI +--------+ el1_irq (pcI, fpB, spI) | | | ... | | | | fpB | | spB | | pcB | | ... | fpB,spB +--------+ func_B (pcB, fpB, spB) | fpC | | pcC | | ... | fpC,spC +--------+ func_C (pcC, fpC, spC) | fpD | | pcD | | ... | fpD,spD +--------+ func_D (pcD, fpD, spD) HIGH | | b.Even worse, if an interrupt occurs in a function prolog of func_B, a stack frame can be uncompleted and a frame pointer may still points to one of func_C (fpC). In this case, we also miss func_C in a stack dump list. c.As a result, stack sizes for functions around el1_irq are wrong. 4) functions under function_graph tracer To catch an event of function return, function_graph tracer rewrites LR register saved in a stack frame to a hook function, return_to_handler. An original value is saved in current->ret_stack[] on entry and restored later on exit by ftrace_return_to_handler() called in return_to_handler. a. With function_graph tracer on, unwind_stackframe() will return lots of bogus entries, "return_to_handler-0x4." Apart from stack tracer, fixing this issue might make more sense, say, for panic cases because panic_stack_dump() also uses unwind_frame(). 5) leaf function gcc (on arm64) will, by default, optimize a leaf function by removing a stack frame. This causes several issues: a.we always miss a case of maximum stack size whenever a leaf function is at the top of call chains because a stack tracer code will never be executed against it. b.When an interrupt occurs during a leaf function (func_B), we miss its parent function (func_C) in a stack dump list because func_B has no stack frame and elr_el1 still points to func_C's frame. (This is a variant case as described in "interrupted frame" section.) LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpB | | pcI | | ... | spI +--------+ el1_irq (pcI, fpC, spI) (pt_regs)| | ... | | fpC | | spB | | pcB | spB +--------+ func_B (pcB, fpC, spB) | ... | fpC,spC +--------+ func_C (pcC, fpC, spC) HIGH | | 6) KVM (T.B.D.) III.my approach and open issues ------------------------------- 1) slurping stack, and 2) differences between x86 and arm64 Instead of slurping stack data like at step-3/4, we extend save_stack_trace() /wind_stackframe() to collect a list of stack pointers for all call chains. Since arm64 ABI does not save a stack pointer in a stack frame, we try to calculate that value from callee's stack frame. For example, in the following case, a value of stack point when func_B calls func_A is calculated as: spB = fpA + (XX -- YY) where XX and YY are determined by a "function prologue analyzer" described in "interrupted frame" section. LOW | ... | fpA +--------+ func_A (pcA, fpA, spA) | fpB | | pcB | | ... <----------- static local variables in func_A | ... | and extra function args to func_A spB + - - - -+ | ... <----------- dynamically allocated variables in func_B fpB +--------+ func_B (pcB, fpB, spB) | fpC | | pcC | | ... <----------- static local variables in func_B | ... | and extra function args to func_B spC + - - - -+ | ... | fpC +--------+ func_C (pcC, fpC, spC) HIGH | | open issue-0: This fix also depends on function prologue analyzer being available. 3) interrupted frame First of all, we should update a frame pointer even in exception vector code so that unwind_frame() can dereference stack frames. Since exception handlers place data of 'struct pt_regs' instead of 'struct stackframe' in *normal* functions, we will take special care of this case. open issue-1: should we use the same form of stack frame here? This way, we will see an additional entry, func_B in the figure above, in a stack dump list but we still have a chance to miss func_C. My idea here is to analyze a function prologue and determine where an interrupt is actually taken, then to re-build a stack frame if necessary. open issue-2: it seems to be very difficult to implement such a parser for all the possible patterns of function prologue. This is partly because gcc doesn't use a fixed template to generate a function prologue code, and partly because "instruction scheduling" may mess up a function prologue and a body. According to comments from some toolchain guy, using libunwind with compiler-generated "unwind tables" information would be most promising for implementing an function prologue analyzer. It seems, however, that any previous attempts to merge libunwind into the kernel has been abandoned[2]. So, in the followings, I try to describe an outline of my own prototype (simple) analyzer and how it is utilized in order to find out func_C. My analyzer recognizes two patterns of function prologue, (A) and (B), which I think cover most cases, and identifies a location of interrupt, 1, 2, 3 and 0 (a normal case). function prologues ------------------ (A) sub sp, sp, #XX stp x29, x30, [sp, #YY] add x29, sp, #YY ... (B) stp x29, x30, [sp, #-XX] mov x29, sp ... Due to gcc's instruction scheduling, we may see other instructions between those assembler lines of function prologue. location of interrupt --------------------- (A) (B) loc-1 sub sp, sp, #XX stp, x29, x30, [sp, #-XX] loc-2 stp x29, x30, [sp, #YY] loc-3 add x29, sp, #YY mov x29, sp loc-0 <function body> <function body> Here, loc-1: SP not updated, FP/LR not saved, FP not updated loc-2: SP updated, FP/LR not saved, FP not updated loc-3: SP updated, FP/LR saved, FP not updated loc-0: SP updated, FP/LR saved, FP updated Re-build a stack frame for func_B and func_C -------------------------------------------- The following figures illustrate how we can re-build a stack frame for func_B and func_C based on available data on the stack. (loc-0) LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpI | | pcI | | ... | fpI,spI +--------+ el1_irq (pcI, fpI, spI) | | | ... | | | | fpB | | spB | | pcB | | ... | fpB,spB +--------+ func_B (pcB, fpB, spB) | fpC | | pcC | | ... | fpC,spC +--------+ func_C (pcC, fpC, spC) | fpD | | pcD | | ... | fpD,spD +--------+ func_D (pcD, fpD, spD) HIGH | | => calculate frame B: sp: *(spI + S_SP) fp: *(spI + S_FP) pc: *(spI + S_PC) (loc-3) LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpI | | pcI | | ... | fpI,spI +--------+ el1_irq (pcI, fpI, spI) | | | ... | | | | fpC | | spB | | pcB | | ... | spB +--------+ func_B (pcB, fpC, spB) | fpC | | pcC | | ... | fpC,spC +--------+ func_C (pcC, fpC, spC) | fpD | | pcD | | ... | fpD,spD +--------+ func_D (pcD, fpD, spD) HIGH | | => calculate frame B: sp: *(spI + F_SP) fp: *(spI + F_FP) pc: *(spI + F_PC) frame C: sp: spB + XX fp: fpB' pc: *(spB + YY + 0x8) (loc-2) LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpI | | pcI | | ... | fpI,spI +--------+ el1_irq (pcI, fpI, spI) | | | ... | | | | fpC | | spB | | pcB | | ... | spB +--------+ func_B (pcB, fpC, spB) | x | | x | | ... | fpC,spC +--------+ func_C (pcC, fpC, spC) | fpD | | pcD | | ... | fpD,spD +--------+ func_D (pcD, fpD, spD) HIGH | | => calculate frame B: sp: *(spI + F_SP) fp: *(spI + F_FP) pc: *(spI + F_PC) frame C: sp: spB + XX fp: fpB' pc: "branch dest at pcD" (loc-1) LOW | | fpA,spA +--------+ func_A (pcA, fpA, spA) | fpI | | pcI | | ... | fpI,spI +--------+ el1_irq (pcI, fpI, spI) | | | ... | | | | fpC | | spC | | pcB | | ... | func_B (pcB, fpC, spC) fpC,spC +--------+ func_C (pcC, fpC, spC) | fpD | | pcD | | ... | fpD,spD +--------+ func_D (pcD, fpD, spD) HIGH | | => calculate frame B: sp: *(spI + F_SP) fp: *(spI + F_FP) pc: *(spI + F_PC) frame C: sp: spB' fp: fpB' pc: "branch dest at pcD" In the case of loc-1 or loc-2, open issue-3: PC derived from parent's bl instruction (pcC) is an entry address of the function, not the exact place when an interrupt is taken. open issue-4: If parent's bl instruction is "blr", we have no way to identify a branch destination address. open issue-5: Each tack pointer is determined from child to parent, so PC for func_C cannot be fixed within wind_stackframe(). 4) functions under function_graph tracer Unwind_frame() tries to replace "frame->pc" with an original value by looking it up in current->ret_stack[]. Unfortunately, saving a value to this array, incrementing an array index, current->curr_ret_index, and updating memory on the stack are not done atomically because ftrace does not disable interrupts. So if an interrupt occurs and succeeding functions are also traced, recovering a stack dump list (stack_trace->entries) from ret_stack[] may result in a wrong list being created. open issue-6: wrong replacement Assume that func_C calls func_B, func_B is interrupted, and el1_irq calls func_A. Depending on when an interrupt is taken during func_B, wind_stackframe() may return one of following stack dump lists: (please note that ftrace related functions are *not* traced.) (A) interrupt during a function prologue ... return_to_handler(for A) el1_irq ... ftrace_push_return_trace prepare_ftrace_return ftrace_caller func_B return_to_handler(for C) we have pushed func_B in ret_stack[], but LR in a stack frame has not yet been rewritten to return_to_handler. since ret_stack[] has func_A, func_B and func_C, we replace return_to_handler(for A) to func_A return_to_handler(for C) to func_B and so on. (B) interrupt during a return hook ... return_to_handler(for A) el1_irq ... ftrace_pop_return_trace ftrace_return_to_handler return_to_handler(for B) return_to_handler(for C) we may already have popped func_B from ret_stack[], and we replace return_to_handler(for A) to func_A return_to_handler(for B) to func_C and so on. In fact, there are more variant cases as pushing/popping against ret_stack[] is not an atomic operation and so data in this array and its index, current->ret_stack_index, may get inconsistent. 5) leaf function Adding arch-specific gcc option, -mno-omit-leaf-frame-pointer, to KBUILD_CFLAGS will assure that all the C functions have its own stack frame (after its function prologue). Then, issue 1-b will fall into issue-2. open issue-7: This fix will increase the code size of the kernel and add extra instructions at runtime. --- [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/354126.html [2] https://lkml.org/lkml/2012/2/10/129 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/linux-arm-kernel