Re: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for global trampoline
From: Alexei Starovoitov <hidden>
Date: 2025-07-15 02:25:36
Also in:
bpf, lkml
On Thu, Jul 3, 2025 at 5:17 AM Menglong Dong [off-list ref] wrote:
+static __always_inline void
+do_origin_call(unsigned long *args, unsigned long *ip, int nr_args)
+{
+ /* Following code will be optimized by the compiler, as nr_args
+ * is a const, and there will be no condition here.
+ */
+ if (nr_args == 0) {
+ asm volatile(
+ RESTORE_ORIGIN_0 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ :
+ );
+ } else if (nr_args == 1) {
+ asm volatile(
+ RESTORE_ORIGIN_1 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi"
+ );
+ } else if (nr_args == 2) {
+ asm volatile(
+ RESTORE_ORIGIN_2 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi", "rsi"
+ );
+ } else if (nr_args == 3) {
+ asm volatile(
+ RESTORE_ORIGIN_3 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi", "rsi", "rdx"
+ );
+ } else if (nr_args == 4) {
+ asm volatile(
+ RESTORE_ORIGIN_4 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi", "rsi", "rdx", "rcx"
+ );
+ } else if (nr_args == 5) {
+ asm volatile(
+ RESTORE_ORIGIN_5 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi", "rsi", "rdx", "rcx", "r8"
+ );
+ } else if (nr_args == 6) {
+ asm volatile(
+ RESTORE_ORIGIN_6 CALL_NOSPEC "\n"
+ "movq %%rax, %0\n"
+ : "=m"(args[nr_args]), ASM_CALL_CONSTRAINT
+ : [args]"r"(args), [thunk_target]"r"(*ip)
+ : "rdi", "rsi", "rdx", "rcx", "r8", "r9"
+ );
+ }
+}What is the performance difference between 0-6 variants? I would think save/restore of regs shouldn't be that expensive. bpf trampoline saves only what's necessary because it can do this micro optimization, but for this one, I think, doing _one_ global trampoline that covers all cases will simplify the code a lot, but please benchmark the difference to understand the trade-off. The major simplification will be due to skipping nr_args. There won't be a need to do btf model and count the args. Just do one trampoline for them all. Also funcs with 7+ arguments need to be thought through from the start. I think it's ok trade-off if we allow global trampoline to be safe to attach to a function with 7+ args (and it will not mess with the stack), but bpf prog can only access up to 6 args. The kfuncs to access arg 7 might be more complex and slower. It's ok trade off.
+
+static __always_inline notrace void
+run_tramp_prog(struct kfunc_md_tramp_prog *tramp_prog,
+ struct bpf_tramp_run_ctx *run_ctx, unsigned long *args)
+{
+ struct bpf_prog *prog;
+ u64 start_time;
+
+ while (tramp_prog) {
+ prog = tramp_prog->prog;
+ run_ctx->bpf_cookie = tramp_prog->cookie;
+ start_time = bpf_gtramp_enter(prog, run_ctx);
+
+ if (likely(start_time)) {
+ asm volatile(
+ CALL_NOSPEC "\n"
+ : : [thunk_target]"r"(prog->bpf_func), [args]"D"(args)
+ );Why this cannot be "call *(prog->bpf_func)" ?
+ } + + bpf_gtramp_exit(prog, start_time, run_ctx); + tramp_prog = tramp_prog->next; + } +} + +static __always_inline notrace int +bpf_global_caller_run(unsigned long *args, unsigned long *ip, int nr_args)
Pls share top 10 from "perf report" while running the bench. I'm curious about what's hot. Last time I benchmarked fentry/fexit migrate_disable/enable were one the hottest functions. I suspect it's the case here as well.