Re: [RFC Patch bpf-next] bpf: introduce bpf timer

From: Cong Wang <hidden>
Date: 2021-04-02 21:25:04
Also in: bpf

On Fri, Apr 2, 2021 at 12:28 PM Alexei Starovoitov
[off-list ref] wrote:

On Wed, Mar 31, 2021 at 09:26:35PM -0700, Cong Wang wrote:

quoted

This patch introduces a bpf timer map and a syscall to create bpf timer
from user-space.

That will severely limit timer api usability.
I agree with Song here. If user space has to create it there is no reason
to introduce new sys_bpf command. Just do all timers in user space
and trigger bpf prog via bpf_prog_test_run cmd.

I think you misunderstand my point, the reason why the creation is done
in user-space is not I like it, it is because it looks impossible to
create timers
in kernel-space, hence I have to move it to user-space.

quoted

The reason why we have to use a map is because the lifetime of a timer,
without a map, we have to delete the timer before exiting the eBPF program,
this would significately limit its use cases. With a map, the timer can
stay as long as the map itself and can be actually updated via map update
API's too,

this part is correct.

quoted

where the key is the timer ID and the value is the timer expire
timer.

The timer ID is unnecessary. We cannot introduce new IDR for every new
bpf object. It doesn't scale.

The IDR is per map, not per timer.

quoted

Timer creation is not easy either. In order to prevent users creating a
timer but not adding it to a map, we have to enforce this in the API which
takes a map parameter and adds the new timer into the map in one shot.

Not quite true. The timer memory should be a part of the map otherwise
the timer life time is hard to track. But arming the timer and initializing
it with a callback doesn't need to be tied with allocation of timer memory.

quoted

And because timer is asynchronous, we can not just use its callback like
bpf_for_each_map_elem().

Not quite. We can do it the same way as bpf_for_each_map_elem() despite
being async.

Well, at least bpf_for_each_map_elem() can use stack variables etc.,
but timers can't. They are very different to me.

quoted

More importantly, we have to properly reference
count its struct bpf_prog too.

It's true that callback prog or subprog has to stay alive while timer
is alive.
Traditional maps can live past the time of the progs that use them.
Like bpf prog can load with a pointer to already created hash map.
Then prog can unload and hashmap will stay around just fine.
All maps are like this with the exception of prog_array.
The progs get deleted from the prog_array map when appropriate.
The same thing can work for maps with embedded timers.
For example the subprog/prog can to be deleted from the timer if
that prog is going away. Similar to ref/uref distinction we have for prog_array.

quoted

It seems impossible to do this either in
verifier or in JIT, so we have to make its callback code a separate eBPF
program and pass a program fd from user-space. Fortunately, timer callback
can still live in the same object file with the rest eBPF code and share
data too.

Here is a quick demo of the timer callback code:

static __u64
check_expired_elem(struct bpf_map *map, __u32 *key, __u64 *val,
                  int *data)
{
  u64 expires = *val;

  if (expires < bpf_jiffies64()) {
    bpf_map_delete_elem(map, key);
    *data++;
  }
  return 0;
}

SEC("timer")
u32 timer_callback(void)
{
  int count = 0;

  bpf_for_each_map_elem(&map, check_expired_elem, &count, 0);
  if (count)
     return 0; // not re-arm this timer
  else
     return 10; // reschedule this timeGr after 10 jiffies
}

As Song pointed out the exact same thing can be done with timers in user space
and user space triggering prog exec with bpf_prog_test_run.

Here is how more general timers might look like:
https://lore.kernel.org/bpf/20210310011905.ozz4xahpkqbfkkvd@ast-mbp.dhcp.thefacebook.com/ (local)

include/uapi/linux/bpf.h:
struct bpf_timer {
  u64 opaque;
};
The 'opaque' field contains a pointer to dynamically allocated struct timer_list and other data.

This is my initial design as we already discussed, it does not work,
please see below.

The prog would do:
struct map_elem {
    int stuff;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(struct map_elem *elem)
{
    if (whatever && elem->stuff)
        bpf_timer_mod(&elem->timer, new_expire);
}

int bpf_timer_test(...)
{
    struct map_elem *val;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb, flags);
        val->stuff = 123;
        bpf_timer_mod(&val->timer, expires);
    }
}

bpf_map_update_elem() either from bpf prog or from user space
allocates map element and zeros 8 byte space for the timer pointer.
bpf_timer_init() allocates timer_list and stores it into opaque if opaque == 0.
The validation of timer_cb() is done by the verifier.
bpf_map_delete_elem() either from bpf prog or from user space
does del_timer() if elem->opaque != 0.
If prog refers such hmap as above during prog free the kernel does
for_each_map_elem {if (elem->opaque) del_timer().}
I think that is the simplest way of prevent timers firing past the prog life time.
There could be other ways to solve it (like prog_array and ref/uref).

Pseudo code:
int bpf_timer_init(struct bpf_timer *timer, void *timer_cb, int flags)
{
  if (timer->opaque)
    return -EBUSY;
  t = alloc timer_list
  t->cb = timer_cb;
  t->..
  timer->opaque = (long)t;
}

int bpf_timer_mod(struct bpf_timer *timer, u64 expires)
{
  if (!time->opaque)
    return -EINVAL;
  t = (struct timer_list *)timer->opaque;
  mod_timer(t,..);
}

int bpf_timer_del(struct bpf_timer *timer)
{
  if (!time->opaque)
    return -EINVAL;
  t = (struct timer_list *)timer->opaque;
  del_timer(t);
}

The verifier would need to check that 8 bytes occupied by bpf_timer and not accessed
via load/store by the program. The same way it does it for bpf_spin_lock.

This does not work, because bpf_timer_del() has to be matched
with bpf_timer_init(), otherwise we would leak timer resources.
For example:

SEC("foo")
bad_ebpf_code()
{
  struct bpf_timer t;
  bpf_timer_init(&t, ...); // allocate a timer
  bpf_timer_mod(&t, ..);
  // end of BPF program
  // now the timer is leaked, no one will delete it
}

We can not enforce the matching in the verifier, because users would
have to call bpf_timer_del() before exiting, which is not what we want
either.

Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help