Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
From: Alexei Starovoitov <hidden>
Date: 2021-06-16 16:52:58
Also in:
bpf
On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote:
It could be the case, of course. But let's try to think this through to the end before giving up. I think it's mostly because we are trying to be too clever with lockless synchronization.
imo your proposed code fits "too clever" too ;) Just a reminder that few emails ago you've argued about "obviously correct" approach, but now...
I had a feeling that bpf_timer_cb needs to take lock as well. But once
we add that, refcounting becomes simpler and more deterministic, IMO.
Here's what I have in mind. I keep only important parts of the code,
so it's not a complete implementation. Please take a look below, I
left a few comments here and there.
struct bpf_hrtimer {
struct hrtimer timer;
struct bpf_map *map;
void *value;
struct bpf_prog *prog;
void *callback_fn;
/* pointer to that lock in struct bpf_timer_kern
* so that we can access it from bpf_timer_cb()
*/
struct bpf_spin_lock *lock;
};
BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
struct bpf_map *, map)
{
struct bpf_hrtimer *t;
int ret = 0;
____bpf_spin_lock(&timer->lock);
t = timer->timer;
if (t) {
ret = -EBUSY;
goto out;
}
/* allocate hrtimer via map_kmalloc to use memcg accounting */
t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
if (!t) {
ret = -ENOMEM;
goto out;
}
t->value = (void *)timer /* - offset of bpf_timer inside elem */;
t->map = map;
t->timer.function = bpf_timer_cb;
/* we'll init them in bpf_timer_start */
t->prog = NULL;
t->callback_fn = NULL;
hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
timer->timer = t;
out:
____bpf_spin_unlock(&timer->lock);
return ret;
}
BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
void *, cb, struct bpf_prog *, prog)
{
struct bpf_hrtimer *t;
int ret = 0;
____bpf_spin_lock(&timer->lock);
t = timer->timer;
if (!t) {
ret = -EINVAL;
goto out;
}
/* doesn't matter what it returns, we just request cancellation */
hrtimer_try_to_cancel(&t->timer);
/* t->prog might not be the same as prog (!) */
if (prog != t->prog) {
/* callback hasn't yet dropped refcnt */
if (t->prog) /* if it's null bpf_timer_cb() is running and
will put it later */
bpf_prog_put(t->prog);
if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
/* this will only happen if prog is still running (and
it's actually us),
* but it was already put to zero, e.g., by closing last FD,
* so there is no point in scheduling a new run
*/I have a bit of mind explosion here... everything will be alright.
t->prog = NULL;
t->callback_fn = NULL;
ret = -E_WE_ARE_SHUTTING_DOWN;
goto out;
}
} /* otherwise we keep existing refcnt on t->prog == prog */
/* potentially new combination of prog and cb */
t->prog = prog;
t->callback_fn = cb;
hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
out:
____bpf_spin_unlock(&timer->lock);
return ret;
}
BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
{
struct bpf_hrtimer *t;
int ret = 0;
____bpf_spin_lock(&timer->lock);
t = timer->timer;
if (!t) {
ret = -EINVAL;
goto out;
}
/* this part I still worry about due to possibility of cpu migration,
* we need to think if we should migrate_disable() in bpf_timer_cb()
* and bpf_timer_* helpers(), but that's a separate topic
*/
if (this_cpu_read(hrtimer_running) == t) {
ret = -EDEADLK;
goto out;
}
ret = hrtimer_cancel(&t->timer);
if (t->prog) {
/* bpf_timer_cb hasn't put it yet (and now won't) */
bpf_prog_put(t->prog);
t->prog = NULL;
t->callback_fn = NULL;
}
out:
____bpf_spin_unlock(&timer->lock);
return ret;
}
static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
{
struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
struct bpf_map *map = t->map;
struct bpf_prog *prog;
void *key, *callback_fn;
u32 idx;
int ret;
/* this is very IMPORTANT */
____bpf_spin_lock(t->lock);
prog = t->prog;
if (!prog) {
/* we were cancelled, prog is put already, exit early */
____bpf_spin_unlock(&timer->lock);
return HRTIMER_NORESTART;
}
callback_fn = t->callback_fn;
/* make sure bpf_timer_cancel/bpf_timer_start won't
bpf_prog_put our prog */
t->prog = NULL;
t->callback_fn = NULL;
____bpf_spin_unlock(t->lock);
/* at this point we "own" prog's refcnt decrement */
this_cpu_write(hrtimer_running, t);
...
ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
(u64)(long)key,
(u64)(long)value, 0, 0);
WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
bpf_prog_put(prog); /* always correct and non-racy */
this_cpu_write(hrtimer_running, NULL);
return HRTIMER_NORESTART;
}
bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
as everywhere elseI haven't started detailed analysis of above proposal, but looks overly complicated on the first glance. Not saying it's bad or good. Just complexity and races are striking.
quoted
There is no need to complicate bpf_timer with crazy refcnting schemes. The user space can simply pin the program in bpffs. In the future we might introduce a self-pinning helper that would pin the program and create a file. Sort-of like syscall prog type would pin self. That would be explicit and clean api instead of obscure hacks inside bpf_timer*().Do I understand correctly that the alternative that you are proposing is to keep some linked list of all map_values across all maps in the system that have initialized bpf_hrtimer with that particular bpf_prog in them? And when bpf_prog is put to zero you'll go and destruct them all in a race-free way? I have a bit of a hard time imagining how that will be implemented exactly, so I might be overcomplicating that in my mind. Will be happy to see the working code.
Here is working code... Note how patch 1 is so much simpler without complicated refcnting. And how patch 2 removes for_each_map_element that was necessary earlier. Also note that link list approach is an optimization. Instead of keeping a link list the bpf_free_used_timers() could call a map specific op to iterate all elems and free timers with timer->prog == prog_going_away. That was my initial proposal couple month ago. link_list is purely an optimization instead of for_each_map_elem.
Attachments
- 0001-bpf-Cancel-and-free-timers-when-prog-is-going-away.patch [text/plain] 7318 bytes · preview
- 0002-bpf-Don-t-iterate-all-map-elements-anymore.patch [text/plain] 1771 bytes · preview