Thread (39 messages) 39 messages, 5 authors, 2025-02-06

Re: [PATCH v22 00/20] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

From: Jiri Olsa <hidden>
Date: 2025-01-14 15:12:13
Also in: bpf, linux-arch, lkml

On Fri, Jan 10, 2025 at 04:04:37PM -0800, Andrii Nakryiko wrote:
On Thu, Jan 2, 2025 at 5:21 AM Jiri Olsa [off-list ref] wrote:
quoted
On Thu, Dec 26, 2024 at 02:11:16PM +0900, Masami Hiramatsu (Google) wrote:
quoted
Hi,

Here is the 22nd version of the series to re-implement the fprobe on
function-graph tracer. The previous version is;

https://lore.kernel.org/all/173379652547.973433.2311391879173461183.stgit@devnote2/ (local)

This version is rebased on v6.13-rc4 with fixes on [3/20] for x86-32 and
[5/20] for build error.

hi,
I ran the bench and I'm seeing native_sched_clock being used
again kretprobe_multi bench:

     5.85%  bench            [kernel.kallsyms]                                        [k] native_sched_clock
            |
            ---native_sched_clock
               sched_clock
               |
                --5.83%--trace_clock_local
                          ftrace_return_to_handler
                          return_to_handler
                          syscall
                          bpf_prog_test_run_opts
completely unrelated, Jiri, but we should stop using
bpf_prog_test_run_opts() for benchmarking. It goes through FD
refcounting, which is unnecessary tiny overhead, but more importantly
it causes cache line bouncing between multiple CPUs (when doing
multi-threaded benchmarks), which skews and limits results.
so you mean to switch directly to attaching/hitting kernel functions
or perhaps better have kernel module for that?

jirka
quoted
                          trigger_producer_batch
                          start_thread
                          __GI___clone3

I recall we tried to fix that before with [1] change, but that replaced
later with [2] changes

When I remove the trace_clock_local call in __ftrace_return_to_handler
than the kretprobe-multi gets much faster (see last block below), so it
seems worth to make it optional

there's some decrease in kprobe_multi benchmark compared to base numbers,
which I'm not sure yet why, but other than that it seems ok

base:
        kprobe         :   12.873 ± 0.011M/s
        kprobe-multi   :   13.088 ± 0.052M/s
        kretprobe      :    6.339 ± 0.003M/s
        kretprobe-multi:    7.240 ± 0.002M/s

fprobe_on_fgraph:
        kprobe         :   12.816 ± 0.002M/s
        kprobe-multi   :   12.126 ± 0.004M/s
        kretprobe      :    6.305 ± 0.018M/s
        kretprobe-multi:    7.740 ± 0.003M/s

removed native_sched_clock call:
        kprobe         :   12.850 ± 0.006M/s
        kprobe-multi   :   12.115 ± 0.006M/s
        kretprobe      :    6.270 ± 0.017M/s
        kretprobe-multi:    9.190 ± 0.005M/s


happy new year ;-) thanks,

jirka


[1] https://lore.kernel.org/bpf/172615389864.133222.14452329708227900626.stgit@devnote2/ (local)
[2] https://lore.kernel.org/all/20240914214805.779822616@goodmis.org/ (local)
[...]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help