Re: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add... | netdev

Re: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for global trampoline

From: Menglong Dong <hidden>
Date: 2025-07-17 02:39:01
Also in: bpf, lkml

On Thursday, July 17, 2025 10:13 AM Alexei Starovoitov [off-list ref] write:

On Wed, Jul 16, 2025 at 6:51 PM Menglong Dong [off-list ref] wrote:

quoted

On Thursday, July 17, 2025 8:59 AM Alexei Starovoitov [off-list ref] write:

quoted

On Wed, Jul 16, 2025 at 6:06 AM Menglong Dong [off-list ref] wrote:

quoted

On Wednesday, July 16, 2025 12:35 AM Alexei Starovoitov [off-list ref] write:

quoted

On Tue, Jul 15, 2025 at 1:37 AM Menglong Dong [off-list ref] wrote:

quoted


On 7/15/25 10:25, Alexei Starovoitov wrote:

[......]

quoted

According to my benchmark, it has ~5% overhead to save/restore
*5* variants when compared with *0* variant. The save/restore of regs
is fast, but it still need 12 insn, which can produce ~6% overhead.

I think it's an ok trade off, because with one global trampoline
we do not need to call rhashtable lookup before entering bpf prog.
bpf prog will do it on demand if/when it needs to access arguments.
This will compensate for a bit of lost performance due to extra save/restore.

I don't understand here :/

The rhashtable lookup is done at the beginning of the global trampoline,
which is called before we enter bpf prog. The bpf progs is stored in the
kfunc_md, and we need get them from the hash table.

Ahh. Right.

Looking at the existing bpf trampoline... It has complicated logic
to handle livepatching and tailcalls. Your global trampoline
doesn't, and once that is added it's starting to feel that it will
look just as complex as the current one.
So I think we better repurpose what we have.
Maybe we can rewrite the existing one in C too.

You are right, the tailcalls is not handled yet. But for the livepatching,
it is already handled, as we always get the origin ip from the stack
and call it, just like how the bpf trampoline handle the livepatching.
So no addition handling is needed here.

quoted

How about the following approach.
I think we discussed something like this in the past
and Jiri tried to implement something like this.
Andrii reminded me recently about it.

Say, we need to attach prog A to 30k functions.
10k with 2 args, 10k with 3 args, and 10k with 7 args.
We can generate 3 _existing_ bpf trampolines for 2,3,7 args
with hard coded prog A in there (the cookies would need to be
fetched via binary search similar to kprobe-multi).
The arch_prepare_bpf_trampoline() supports BPF_TRAMP_F_ORIG_STACK.
So one 2-arg trampoline will work to invoke prog A in all 10k 2-arg functions.
We don't need to match types, but have to compare that btf_func_model-s
are the same.

Menglong, your global trampoline for 0,1,..6 args works only for x86,
because btf_func_model doesn't care about sizes of args,
but it's not the correct mental model to use.

The above "10k with 2 args" is a simplified example.
We will need an arch specific callback is_btf_func_model_equal()
that will compare func models in arch specific ways.
For x86-64 the number of args is all it needs.
For other archs it will compare sizes and flags too.
So 30k functions will be sorted into
10k with btf_func_model_1, 10k with btf_func_model_2 and so on.
And the corresponding number of equivalent trampolines will be generated.

Note there will be no actual BTF types. All args will be untyped and
untrusted unlike current fentry.
We can go further and sort 30k functions by comparing BTFs
instead of btf_func_model-s, but I suspect 30k funcs will be split
into several thousands of exact BTFs. At that point multi-fentry
benefits are diminishing and we might as well generate 30k unique
bpf trampolines for 30k functions and avoid all the complexity.
So I would sort by btf_func_model compared by arch specific comparator.

Now say prog B needs to be attached to another 30k functions.
If all 30k+30k functions are different then it's the same as
the previous step.
Say, prog A is attached to 10k funcs with btf_func_model_1.
If prog B wants to attach to the exact same func set then we
just regenerate bpf trampoline with hard coded progs A and B
and reattach.
If not then we need to split the set into up to 3 sets.
Say, prog B wants 5k funcs, but only 1k func are common:
(prog_A, 9k func with btf_func_model_1) -> bpf trampoline X
(prog_A, prog_B, 1k funcs with btf_func_model_1) -> bpf trampoline Y
(prog_B, 4k funcs with btf_func_model_1) -> bpf trampoline Z

And so on when prog C needs to be attached.
At detach time we can merge sets/trampolines,
but for now we can leave it all fragmented.
Unlike regular fentry progs the multi-fentry progs are not going to
be attached for long time. So we can reduce the detach complexity.

The nice part of the algorithm is that coexistence of fentry
and multi-fentry is easy.
If fentry is already attached to some function we just
attach multi-fentry prog to that bpf trampoline.
If multi-fentry was attached first and fentry needs to be attached,
we create a regular bpf trampoline and add both progs there.

This seems not easy, and it is exactly how I handle the
coexistence now:

  https://lore.kernel.org/bpf/20250528034712.138701-16-dongml2@chinatelecom.cn/ (local)
  https://lore.kernel.org/bpf/20250528034712.138701-17-dongml2@chinatelecom.cn/ (local)
  https://lore.kernel.org/bpf/20250528034712.138701-18-dongml2@chinatelecom.cn/ (local)

hmm. exactly? That's very different.
You're relying on kfunc_md for prog list.
The above proposal doesn't need kfunc_md in the critical path.
All progs are built into the trampolines.

quoted

The most difficult part is that we need a way to replace the the
multi-fentry with fentry for the function in the ftrace atomically. Of
course, we can remove the global trampoline first, and then attach
the bpf trampoline, which will make things much easier. But a
short suspend will happen for the progs in fentry-multi.

I don't follow.
In the above proposal fentry attach/detach is atomic.
Prepare a new trampoline, single call to ftrace to modify_fentry().

modify_fentry() is used to operate on the same ftrace_ops. For
example, we have the bpf trampoline A, and its corresponding
ftrace_ops is opsA. Now, the image of the trampolineA is updated,
we call modify_fentry() for opsA to update the direct call of it.

When we talk about the coexistence, it means the functionA is
attached with the global trampoline B, whose ftrace_ops is
opsB. We can't call modify_fentry(trampolineA, new_addr) here,
as the opsA is not register yet. And we can't call register_fentry
too, as the functionA is already in the direct_functions when we
register opsB.

So we need a way to do such transition.

quoted

The intersect and sorting by btf_func_model is not trivial,
but we can hold global trampoline_mutex, so no concerns of races.

Example:
bpf_link_A is a set of:
(prog_A, funcs X,Y with btf_func_model_1)
(prog_A, funcs N,M with btf_func_model_2)

To attach prog B via bpf_link_B that wants:
(prog_B, funcs Y,Z with btf_func_model_1)
(prog_B, funcs P,Q with btf_func_model_3)

walk all existing links, intersect and split, and update the links.
At the end:

bpf_link_A:
(prog_A, funcs X with btf_func_model_1)
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_A, funcs N,M with btf_func_model_2)

bpf_link_B:
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_B, funcs Z with btf_func_model_1)
(prog_B, funcs P,Q with btf_func_model_3)

When link is detached: walk its own tuples, remove the prog,
if nr_progs == 0 -> detach corresponding trampoline,
if nr_progs > 0 -> remove prog and regenerate trampoline.

If fentry prog C needs to be attached to N it might split bpf_link_A:
(prog_A, funcs X with btf_func_model_1)
(prog_A, prog_B funcs Y with btf_func_model_1)
(prog_A, funcs M with btf_func_model_2)
(prog_A, prog_C funcs N with _fentry_)

Last time we gave up on it because we discovered that
overlap support was too complicated, but I cannot recall now
what it was :)
Maybe all of the above repeating some old mistakes.

In my impression, this is exactly the solution of Jiri's, and this is
part of the discussion that I know:

  https://lore.kernel.org/bpf/ZfKY6E8xhSgzYL1I@krava/ (local)

Yes. It's similar, but somehow it feels simple enough now.
The algorithms for both detach and attach fit on one page,
and everything is uniform. There are no spaghetty of corner cases.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help