Thread (42 messages) 42 messages, 5 authors, 2021-08-11

Re: [RFC Patch bpf-next] bpf: introduce bpf timer

From: Song Liu <hidden>
Date: 2021-04-02 19:45:25
Also in: bpf

On Apr 2, 2021, at 12:08 PM, Cong Wang [off-list ref] wrote:

On Fri, Apr 2, 2021 at 10:57 AM Song Liu [off-list ref] wrote:
quoted

quoted
On Apr 2, 2021, at 10:34 AM, Cong Wang [off-list ref] wrote:

On Thu, Apr 1, 2021 at 1:17 PM Song Liu [off-list ref] wrote:
quoted

quoted
On Apr 1, 2021, at 10:28 AM, Cong Wang [off-list ref] wrote:

On Wed, Mar 31, 2021 at 11:38 PM Song Liu [off-list ref] wrote:
quoted

quoted
On Mar 31, 2021, at 9:26 PM, Cong Wang [off-list ref] wrote:

From: Cong Wang <redacted>

(This patch is still in early stage and obviously incomplete. I am sending
it out to get some high-level feedbacks. Please kindly ignore any coding
details for now and focus on the design.)
Could you please explain the use case of the timer? Is it the same as
earlier proposal of BPF_MAP_TYPE_TIMEOUT_HASH?

Assuming that is the case, I guess the use case is to assign an expire
time for each element in a hash map; and periodically remove expired
element from the map.

If this is still correct, my next question is: how does this compare
against a user space timer? Will the user space timer be too slow?
Yes, as I explained in timeout hashmap patchset, doing it in user-space
would require a lot of syscalls (without batching) or copying (with batching).
I will add the explanation here, in case people miss why we need a timer.
How about we use a user space timer to trigger a BPF program (e.g. use
BPF_PROG_TEST_RUN on a raw_tp program); then, in the BPF program, we can
use bpf_for_each_map_elem and bpf_map_delete_elem to scan and update the
map? With this approach, we only need one syscall per period.
Interesting, I didn't know we can explicitly trigger a BPF program running
from user-space. Is it for testing purposes only?
This is not only for testing. We will use this in perf (starting in 5.13).

/* currently in Arnaldo's tree, tools/perf/util/bpf_counter.c: */

/* trigger the leader program on a cpu */
static int bperf_trigger_reading(int prog_fd, int cpu)
{
       DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
                           .ctx_in = NULL,
                           .ctx_size_in = 0,
                           .flags = BPF_F_TEST_RUN_ON_CPU,
                           .cpu = cpu,
                           .retval = 0,
               );

       return bpf_prog_test_run_opts(prog_fd, &opts);
}

test_run also passes return value (retval) back to user space, so we and
adjust the timer interval based on retval.
This is really odd, every name here contains a "test" but it is not for testing
purposes. You probably need to rename/alias it. ;)

So, with this we have to get a user-space daemon running just to keep
this "timer" alive. If I want to run it every 1ms, it means I have to issue
a syscall BPF_PROG_TEST_RUN every 1ms. Even with a timer fd, we
still need poll() and timerfd_settime(). This is a considerable overhead
for just a single timer.
sys_bpf() takes about 0.5us. I would expect poll() and timerfd_settime() to 
be slightly faster. So the overhead is less than 0.2% of a single core 
(0.5us x 3 / 1ms). Do we need many counters for conntrack?
With current design, user-space can just exit after installing the timer,
either it can adjust itself or other eBPF code can adjust it, so the per
timer overhead is the same as a kernel timer.
I guess we still need to hold a fd to the prog/map? Alternatively, we can 
pin the prog/map, but then the user need to clean it up. 
The visibility to other BPF code is important for the conntrack case,
because each time we get an expired item during a lookup, we may
want to schedule the GC timer to run sooner. At least this would give
users more freedom to decide when to reschedule the timer.
Do we plan to share the timer program among multiple processes (which can 
start and terminate in arbitrary orders)? If that is the case, I can imagine
a timer program is better than a user space timer. 

Thanks,
Song 
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help