Re: [PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces... | linux-trace-kernel

Re: [PATCH v2 3/4] perf,x86: avoid missing caller address in stack traces captured in uprobe

From: Andrii Nakryiko <hidden>
Date: 2024-06-04 17:13:13
Also in: bpf, linux-perf-users

On Tue, Jun 4, 2024 at 7:06 AM Masami Hiramatsu [off-list ref] wrote:

On Tue, 21 May 2024 18:38:44 -0700
Andrii Nakryiko [off-list ref] wrote:

quoted

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

I thought this problem might be solved by sframe.

Eventually, yes, when real-world applications switch to sframe and we
get proper support for it in the kernel. But right now there are tons
of applications relying on kernel capturing stack traces based on
frame pointers, so it would be good to improve that as well.

quoted

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
instruction being traced.

I like this kind of idea :) But I think this should be done in
the user-space, not in the kernel because it is not always sure
that the user program uses stack frames.

Existing kernel code that captures user space stack trace already
assumes that code was compiled with a frame pointer (unconditionally),
because that's the best kernel can do. So under that assumption this
heuristic is valid and not harmful, IMO.

User space can do nothing about this, because it is the kernel that
captures stack trace (e.g., from BPF program), and we will lose the
calling frame if we don't do it here.

quoted

If that's the case, with the assumption that
applicatoin is compiled with frame pointers, this instruction would be
a strong indicator that this is the entry to the function. In that case,
return address is still pointed to by %rsp, so we fetch it and add to
stack trace before proceeding to unwind the rest using frame
pointer-based logic.

Why don't we make it in the userspace BPF program? If it is done
in the user space, like perf-probe, I'm OK. But I doubt to do this in
kernel. That means it is not flexible.

You mean for the BPF program to capture the entire stack trace by
itself, without asking the kernel for help? It's doable, but:

  a) it's inconvenient for all users to have to reimplement this
low-level "primitive" operation, that we already promise is provided
by kernel (through bpf_get_stack() API, and kernel has internal
perf_callchain API for this)
  b) it's faster for kernel to do this, as kernel code disables page
faults once and unwinds the stack, while BPF program would have to do
multiple bpf_probe_read_user() calls, each individually disabling page
faults.

But really, there is an already existing API, which in some cases
returns partially invalid stack traces (skipping function caller's
frame), so this is an attempt to fix this issue.

More than anything, without user-space helper to find function
symbols, uprobe does not know the function entry. Then I'm curious
why don't you do this in the user space.

You are mixing stack *capture* (in which we get memory addresses
representing where a function call or currently running instruction
pointer is) with stack *symbolization* (where user space needs ELF
symbols and/or DWARF information to translate those addresses into
something human-readable).

This heuristic improves the former as performed by the kernel. Stack
symbolization is completely orthogonal to all of this.

At least, this should be done in the user of uprobes, like trace_uprobe
or bpf.

This is a really miserable user experience, if they have to implement
their own stack trace capture for uprobes, but use built-in
bpf_get_stack() API for any other type of program.

Thank you,

quoted

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 arch/x86/events/core.c  | 20 ++++++++++++++++++++
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 24 insertions(+)

[...]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help