Re: [External] Re: [PATCH bpf-next v2 1/9] bpf: tracing: add support to... | netdev

Re: [External] Re: [PATCH bpf-next v2 1/9] bpf: tracing: add support to record and check the accessed args

From: Jiri Olsa <hidden>
Date: 2024-03-14 06:27:57
Also in: bpf, linux-arm-kernel, linux-kselftest, linux-riscv, linux-s390, linux-trace-kernel, lkml

On Wed, Mar 13, 2024 at 05:25:35PM -0700, Alexei Starovoitov wrote:

On Tue, Mar 12, 2024 at 6:53 PM 梦龙董 [off-list ref] wrote:

quoted

On Wed, Mar 13, 2024 at 12:42 AM Alexei Starovoitov
[off-list ref] wrote:

quoted

On Mon, Mar 11, 2024 at 7:42 PM 梦龙董 [off-list ref] wrote:

quoted

[......]

quoted

I see.
I thought you're sharing the trampoline across attachments.
(since bpf prog is the same).

That seems to be a good idea, which I hadn't thought before.

quoted

But above approach cannot possibly work with a shared trampoline.
You need to create individual trampoline for all attachment
and point them to single bpf prog.

tbh I'm less excited about this feature now, since sharing
the prog across different attachments is nice, but it won't scale
to thousands of attachments.
I assumed that there will be a single trampoline with max(argno)
across attachments and attach/detach will scale to thousands.

With individual trampoline this will work for up to a hundred
attachments max.

What does "a hundred attachments max" means? Can't I
trace thousands of kernel functions with a bpf program of
tracing multi-link?

I mean what time does it take to attach one program
to 100 fentry-s ?
What is the time for 1k and for 10k ?

The kprobe multi test attaches to pretty much all funcs in
/sys/kernel/tracing/available_filter_functions
and it's fast enough to run in test_progs on every commit in bpf CI.
See get_syms() in prog_tests/kprobe_multi_test.c

Can this new multi fentry do that?
and at what speed?
The answer will decide how applicable this api is going to be.
Generating different trampolines for every attach point
is an approach as well. Pls benchmark it too.

quoted

Let's step back.
What is the exact use case you're trying to solve?
Not an artificial one as selftest in patch 9, but the real use case?

I have a tool, which is used to diagnose network problems,
and its name is "nettrace". It will trace many kernel functions, whose
function args contain "skb", like this:

./nettrace -p icmp
begin trace...
***************** ffff889be8fbd500,ffff889be8fbcd00 ***************
[1272349.614564] [dev_gro_receive     ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614579] [__netif_receive_skb_core] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614585] [ip_rcv              ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614592] [ip_rcv_core         ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614599] [skb_clone           ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614616] [nf_hook_slow        ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614629] [nft_do_chain        ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614635] [ip_rcv_finish       ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614643] [ip_route_input_slow ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614647] [fib_validate_source ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614652] [ip_local_deliver    ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614658] [nf_hook_slow        ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614663] [ip_local_deliver_finish] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614666] [icmp_rcv            ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614671] [icmp_echo           ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614675] [icmp_reply          ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614715] [consume_skb         ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614722] [packet_rcv          ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220
[1272349.614725] [consume_skb         ] ICMP: 169.254.128.15 ->
172.27.0.6 ping request, seq: 48220

For now, I have to create a bpf program for every kernel
function that I want to trace, which is up to 200.

With this multi-link, I only need to create 5 bpf program,
like this:

int BPF_PROG(trace_skb_1, struct *skb);
int BPF_PROG(trace_skb_2, u64 arg0, struct *skb);
int BPF_PROG(trace_skb_3, u64 arg0, u64 arg1, struct *skb);
int BPF_PROG(trace_skb_4, u64 arg0, u64 arg1, u64 arg2, struct *skb);
int BPF_PROG(trace_skb_5, u64 arg0, u64 arg1, u64 arg2, u64 arg3, struct *skb);

Then, I can attach trace_skb_1 to all the kernel functions that
I want to trace and whose first arg is skb; attach trace_skb_2 to kernel
functions whose 2nd arg is skb, etc.

Or, I can create only one bpf program and store the index
of skb to the attachment cookie, and attach this program to all
the kernel functions that I want to trace.

This is my use case. With the multi-link, now I only have
1 bpf program, 1 bpf link, 200 trampolines, instead of 200
bpf programs, 200 bpf link and 200 trampolines.

I see. The use case makes sense to me.
Andrii's retsnoop is used to do similar thing before kprobe multi was
introduced.

quoted

The shared trampoline you mentioned seems to be a
wonderful idea, which can make the 200 trampolines
to one. Let me have a look, we create a trampoline and
record the max args count of all the target functions, let's
mark it as arg_count.

During generating the trampoline, we assume that the
function args count is arg_count. During attaching, we
check the consistency of all the target functions, just like
what we do now.

For one trampoline to handle all attach points we might
need some arch support, but we can start simple.
Make btf_func_model with MAX_BPF_FUNC_REG_ARGS
by calling btf_distill_func_proto() with func==NULL.
And use that to build a trampoline.

The challenge is how to use minimal number of trampolines
when bpf_progA is attached for func1, func2, func3
and bpf_progB is attached to func3, func4, func5.
We'd still need 3 trampolines:
for func[12] to call bpf_progA,
for func3 to call bpf_progA and bpf_progB,
for func[45] to call bpf_progB.

Jiri was trying to solve it in the past. His slides from LPC:
https://lpc.events/event/16/contributions/1350/attachments/1033/1983/plumbers.pdf

Pls study them and his prior patchsets to avoid stepping on the same rakes.

yep, I refrained from commenting not to take you down the same path
I did, but if you insist.. ;-) 

I managed to forgot almost all of it, but the IIRC the main pain point
was that at some point I had to split existing trampoline which caused
the whole trampolines management and error paths to become a mess

I tried to explain things in [1] changelog and the latest patchset is in [0]

feel free to use/take anything, but I advice strongly against it ;-)
please let me know if I can help

jirka


[0] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=bpf/batch
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=bpf/batch&id=52a1d4acdf55df41e99ca2cea51865e6821036ce

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help