Re: [RFC Patch net-next] net_sched: introduce eBPF based Qdisc
From: Martin KaFai Lau <hidden>
Date: 2021-09-01 17:45:52
Also in:
bpf
On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
John Fastabend [off-list ref] writes:quoted
Cong Wang wrote:quoted
On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau [off-list ref] wrote:quoted
Please explain more on this. What is currently missing to make qdisc in struct_ops possible?I think you misunderstand this point. The reason why I avoid it is _not_ anything is missing, quite oppositely, it is because it requires a lot of work to implement a Qdisc with struct_ops approach, literally all those struct Qdisc_ops (not to mention struct Qdisc_class_ops). WIth current approach, programmers only need to implement two eBPF programs (enqueue and dequeue).
_if_ it is using as a qdisc object/interface, the patch "looks" easier because it obscures some of the ops/interface from the bpf user. The user will eventually ask for more flexibility and then an on-par interface as the kernel's qdisc. If there are some common 'ops', the common bpf code can be shared as a library in userspace or there is also kfunc call to call into the kernel implementation. For existing kernel qdisc author, it will be easier to use the same interface also.
quoted
Another idea. Rather than work with qdisc objects which creates all these issues with how to work with existing interfaces, filters, etc. Why not create an sk_buff map? Then this can be used from the existing egress/ingress hooks independent of the actual qdisc being used.I agree. In fact, I'm working on doing just this for XDP, and I see no reason why the map type couldn't be reused for skbs as well. Doing it this way has a couple of benefits: - It leaves more flexibility to BPF: want a simple FIFO queue? just implement that with a single queue map. Or do you want to build a full hierarchical queueing structure? Just instantiate as many queue maps as you need to achieve this. Etc.
Agree. Regardless how the interface may look like, I even think being able to queue/dequeue an skb into different bpf maps should be the first thing to do here. Looking forward to your patches.
- The behaviour is defined entirely by BPF program behaviour, and does not require setting up a qdisc hierarchy in addition to writing BPF code.
Interesting idea. If it does not need to use the qdisc object/interface and be able to do the qdisc hierarchy setup in a programmable way, it may be nice. It will be useful for the future patches to come with some bpf prog examples to do that.
- It should be possible to structure the hooks in a way that allows reusing queueing algorithm implementations between the qdisc and XDP layers.quoted
You mention skb should not be exposed to userspace? Why? Whats the reason for this? Anyways we can make kernel only maps if we want or scrub the data before passing it to userspace. We do this already in some cases.Yup, that's my approach as well.quoted
IMO it seems cleaner and more general to allow sk_buffs to be stored in maps and pulled back out later for enqueue/dequeue.FWIW there's some gnarly details here (for instance, we need to make sure the BPF program doesn't leak packet references after they are dequeued from the map). My idea is to use a scheme similar to what we do for XDP_REDIRECT, where a helper sets some hidden variables and doesn't actually remove the packet from the queue until the BPF program exits (so the kernel can make sure things are accounted correctly).
The verifier is tracking the sk's references. Can it be reused to track the skb's reference?
quoted
I think one trick might be how to trigger the dequeue event on transition from stopped to running net_device or other events like this, but that could be solved with another program attached to those events to kick the dequeue logic.This is actually easy in the qdisc case, I think: there's already a qdisc_dequeue() operation, which just needs to execute a BPF program that picks which packet to dequeue (by pulling it off a queue map). For XDP we do need a new hook, on driver TX completion or something like that. Details TBD. Also, we need a way to BPF to kick an idle interface and make it start transmitting; that way we can implement a traffic shaper (that delays packets) by using BPF timers :) -Toke