Re: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for global trampoline
From: Menglong Dong <hidden>
Date: 2025-07-15 08:38:11
Also in:
bpf, lkml
On 7/15/25 10:25, Alexei Starovoitov wrote:
On Thu, Jul 3, 2025 at 5:17 AM Menglong Dong [off-list ref] wrote:quoted
+static __always_inline void +do_origin_call(unsigned long *args, unsigned long *ip, int nr_args) +{ + /* Following code will be optimized by the compiler, as nr_args + * is a const, and there will be no condition here. + */ + if (nr_args == 0) { + asm volatile( + RESTORE_ORIGIN_0 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : + ); + } else if (nr_args == 1) { + asm volatile( + RESTORE_ORIGIN_1 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi" + ); + } else if (nr_args == 2) { + asm volatile( + RESTORE_ORIGIN_2 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi", "rsi" + ); + } else if (nr_args == 3) { + asm volatile( + RESTORE_ORIGIN_3 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi", "rsi", "rdx" + ); + } else if (nr_args == 4) { + asm volatile( + RESTORE_ORIGIN_4 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi", "rsi", "rdx", "rcx" + ); + } else if (nr_args == 5) { + asm volatile( + RESTORE_ORIGIN_5 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi", "rsi", "rdx", "rcx", "r8" + ); + } else if (nr_args == 6) { + asm volatile( + RESTORE_ORIGIN_6 CALL_NOSPEC "\n" + "movq %%rax, %0\n" + : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT + : [args]"r"(args), [thunk_target]"r"(*ip) + : "rdi", "rsi", "rdx", "rcx", "r8", "r9" + ); + } +}What is the performance difference between 0-6 variants? I would think save/restore of regs shouldn't be that expensive. bpf trampoline saves only what's necessary because it can do this micro optimization, but for this one, I think, doing _one_ global trampoline that covers all cases will simplify the code a lot, but please benchmark the difference to understand the trade-off.
According to my benchmark, it has ~5% overhead to save/restore *5* variants when compared with *0* variant. The save/restore of regs is fast, but it still need 12 insn, which can produce ~6% overhead. I think the performance is more import and we should keep this logic. Should we? If you think the do_origin_call() is not simple enough, we can recover all the 6 regs from the stack directly for the origin call, which won't introduce too much overhead, and keep the save/restore logic. What do you think?
The major simplification will be due to skipping nr_args. There won't be a need to do btf model and count the args. Just do one trampoline for them all. Also funcs with 7+ arguments need to be thought through from the start.
In the current version, the attachment will be rejected if any functions have 7+ arguments.
I think it's ok trade-off if we allow global trampoline to be safe to attach to a function with 7+ args (and it will not mess with the stack), but bpf prog can only access up to 6 args. The kfuncs to access arg 7 might be more complex and slower. It's ok trade off.
It's OK for fentry-multi, but we can't allow fexit-multi and modify_return-multi to be attached to the function with 7+ args, as we need to do the origin call, and we can't recover the arguments in the stack for the origin call for now. So we can allow the functions with 7+ args to be attached as long as the accessed arguments are all in regs for fentry-multi. And I think we need one more patch to do the "all accessed arguments are in regs" checking, so maybe we can put it in the next series? As current series is a little complex :/ Anyway, I'll have a try to see if we can add this part in this series :)
quoted
+ +static __always_inline notrace void +run_tramp_prog(struct kfunc_md_tramp_prog *tramp_prog, + struct bpf_tramp_run_ctx *run_ctx, unsigned long *args) +{ + struct bpf_prog *prog; + u64 start_time; + + while (tramp_prog) { + prog = tramp_prog->prog; + run_ctx->bpf_cookie = tramp_prog->cookie; + start_time = bpf_gtramp_enter(prog, run_ctx); + + if (likely(start_time)) { + asm volatile( + CALL_NOSPEC "\n" + : : [thunk_target]"r"(prog->bpf_func), [args]"D"(args) + );Why this cannot be "call *(prog->bpf_func)" ?
Do you mean "prog->bpf_func(args, NULL);"? In my previous testing, this cause bad performance, and I see others do the indirect call in this way. And I just do the benchmark again, it seems the performance is not affected in this way anymore. So I think I can replace it with "prog->bpf_func(args, NULL);" in the next version.
quoted
+ } + + bpf_gtramp_exit(prog, start_time, run_ctx); + tramp_prog = tramp_prog->next; + } +} + +static __always_inline notrace int +bpf_global_caller_run(unsigned long *args, unsigned long *ip, int nr_args)Pls share top 10 from "perf report" while running the bench. I'm curious about what's hot. Last time I benchmarked fentry/fexit migrate_disable/enable were one the hottest functions. I suspect it's the case here as well.
You are right, the migrate_disable/enable are the hottest functions in both bpf trampoline and global trampoline. Following is the perf top for fentry-multi: 36.36% bpf_prog_2dcccf652aac1793_bench_trigger_fentry_multi [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry_multi 20.54% [kernel] [k] migrate_enable 19.35% [kernel] [k] bpf_global_caller_5_run 6.52% [kernel] [k] bpf_global_caller_5 3.58% libc.so.6 [.] syscall 2.88% [kernel] [k] entry_SYSCALL_64 1.50% [kernel] [k] memchr_inv 1.39% [kernel] [k] fput 1.04% [kernel] [k] migrate_disable 0.91% [kernel] [k] _copy_to_user And I also did the testing for fentry: 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp The migrate_enable/disable are used to do the recursive checking, and I even wanted to perform recursive checks in the same way as ftrace to eliminate this overhead :/