Thread (42 messages) 42 messages, 5 authors, 2021-08-11

Re: [RFC Patch bpf-next] bpf: introduce bpf timer

From: Cong Wang <hidden>
Date: 2021-05-09 05:37:56
Also in: bpf

On Tue, Apr 27, 2021 at 11:34 AM Alexei Starovoitov
[off-list ref] wrote:
On Tue, Apr 27, 2021 at 9:36 AM Cong Wang [off-list ref] wrote:
quoted
If we enforce this ownership, in case of conntrack the owner would be
the program which sees the connection first, which is pretty much
unpredictable. For example, if the ingress program sees a connection
first, it installs a timer for this connection, but the traffic is
bidirectional,
hence egress program needs this connection and its timer too, we
should not remove this timer when the ingress program is freed.
Sure. That's trivially achieved with pinning.
If users forget to do so, their ebpf program would crash the kernel,
right? But ebpf programs should never crash the kernel, right?
One can have an ingress prog that tailcalls into another prog
that arms the timer with one of its subprogs.
Egress prog can tailcall into the same prog as well.
The ingress and egress progs can be replaced one by one
or removed both together and middle prog can stay alive
if it's pinned in bpffs or held alive by FD.
This looks necessarily complex. Look at the overhead of using
a timer properly here:

1. pin timer callback program
2. a program to install timer
3. a program array contains the above program
4. a tail call into the above program array

Why not design a simpler solution?
quoted
From another point of view: maps and programs are both first-class
resources in eBPF, a timer is stored in a map and associated with a
program, so it is naturally a first-class resource too.
Not really. The timer abstraction is about data. It invokes the callback.
That callback is a part of the program. The lifetime of the timer object
and lifetime of the callback can be different.
Obviously the timer logic need to make sure that callback text is alive
when the timer is armed.
Only if the callback could reference struct bpf_prog... And even if it
could, how about users forgetting to do so? ebpf verifier has to reject
such cases.
Combining timer and callback concepts creates a messy abstraction.
In the normal kernel code one can have a timer in any kernel data
structure and callback in the kernel text or in the kernel module.
The code needs to make sure that the module won't go away while
the timer is armed. Same thing with bpf progs. The progs are safe
kernel modules. The timers are independent objects.
Kernel modules can take reference count of its own module very
easily, plus there is no verifier for kernel modules. I don't understand
why you want to make ebpf programs as close to kernel modules as
possible in this case.
quoted
quoted
quoted
quoted
Also if your colleagues have something to share they should be
posting to the mailing list. Right now you're acting as a broken phone
passing info back and forth and the knowledge gets lost.
Please ask your colleagues to participate online.
They are already in CC from the very beginning. And our use case is
public, it is Cilium conntrack:
https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack.h

The entries of the code are:
https://github.com/cilium/cilium/blob/master/bpf/bpf_lxc.c

The maps for conntrack are:
https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack_map.h
If that's the only goal then kernel timers are not needed.
cilium conntrack works well as-is.
We don't go back to why user-space cleanup is inefficient again,
do we? ;)
I remain unconvinced that cilium conntrack _needs_ timer apis.
It works fine in production and I don't hear any complaints
from cilium users. So 'user space cleanup inefficiencies' is
very subjective and cannot be the reason to add timer apis.
I am pretty sure I showed the original report to you when I sent
timeout hashmap patch, in case you forgot here it is again:
https://github.com/cilium/cilium/issues/5048

and let me quote the original report here:

"The current implementation (as of v1.2) for managing the contents of
the datapath connection tracking map leaves something to be desired:
Once per minute, the userspace cilium-agent makes a series of calls to
the bpf() syscall to fetch all of the entries in the map to determine
whether they should be deleted. For each entry in the map, 2-3 calls
must be made: One to fetch the next key, one to fetch the value, and
perhaps one to delete the entry. The maximum size of the map is 1
million entries, and if the current count approaches this size then
the garbage collection goroutine may spend a significant number of CPU
cycles iterating and deleting elements from the conntrack map."

(Adding Joe in Cc too.)

Thanks.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help