Re: [RFC Patch net-next] net_sched: introduce eBPF based Qdisc
From: Toke Høiland-Jørgensen <hidden>
Date: 2021-09-02 22:28:04
Also in:
bpf
John Fastabend [off-list ref] writes:
Toke Høiland-Jørgensen wrote:quoted
Martin KaFai Lau [off-list ref] writes:quoted
On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:quoted
John Fastabend [off-list ref] writes:quoted
Cong Wang wrote:quoted
On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau [off-list ref] wrote:quoted
Please explain more on this. What is currently missing to make qdisc in struct_ops possible?I think you misunderstand this point. The reason why I avoid it is _not_ anything is missing, quite oppositely, it is because it requires a lot of work to implement a Qdisc with struct_ops approach, literally all those struct Qdisc_ops (not to mention struct Qdisc_class_ops). WIth current approach, programmers only need to implement two eBPF programs (enqueue and dequeue)._if_ it is using as a qdisc object/interface, the patch "looks" easier because it obscures some of the ops/interface from the bpf user. The user will eventually ask for more flexibility and then an on-par interface as the kernel's qdisc. If there are some common 'ops', the common bpf code can be shared as a library in userspace or there is also kfunc call to call into the kernel implementation. For existing kernel qdisc author, it will be easier to use the same interface also.The question is if it's useful to provide the full struct_ops for qdiscs? Having it would allow a BPF program to implement that interface towards userspace (things like statistics, classes etc), but the question is if anyone is going to bother with that given the wealth of BPF-specific introspection tools already available?If its a map value then you get all the goodness with normal map inspection.
Yup, exactly, so why bother with struct_ops to implement all the other qdisc ops (apart from enqueue/dequeue)?
quoted
My hope is that we can (longer term) develop some higher-level tools to express queueing policies that can then generate the BPF code needed to implement them. Or as a start just some libraries to make this easier, which I think is also what you're hinting at here? :)The P4 working group has thought about QOS and queuing from P4 side if you want to think in terms of a DSL. Might be interesting and have some benefits if you want to drop into hardware offload side. For example compile to XDP for fast CPU architectures, Altera/Xilinx backend for FPGA or switch silicon for others. This was always the dream on my side maybe we've finally got close to actualizing it, 10 years later ;)
Yup, would love to see this! Let's just hope it doesn't take another decade ;)
quoted
quoted
quoted
quoted
Another idea. Rather than work with qdisc objects which creates all these issues with how to work with existing interfaces, filters, etc. Why not create an sk_buff map? Then this can be used from the existing egress/ingress hooks independent of the actual qdisc being used.I agree. In fact, I'm working on doing just this for XDP, and I see no reason why the map type couldn't be reused for skbs as well. Doing it this way has a couple of benefits: - It leaves more flexibility to BPF: want a simple FIFO queue? just implement that with a single queue map. Or do you want to build a full hierarchical queueing structure? Just instantiate as many queue maps as you need to achieve this. Etc.Agree. Regardless how the interface may look like, I even think being able to queue/dequeue an skb into different bpf maps should be the first thing to do here. Looking forward to your patches.Thanks! Guess I should go work on them, then :DHappy to review any RFCs.quoted
quoted
quoted
- The behaviour is defined entirely by BPF program behaviour, and does not require setting up a qdisc hierarchy in addition to writing BPF code.Interesting idea. If it does not need to use the qdisc object/interface and be able to do the qdisc hierarchy setup in a programmable way, it may be nice. It will be useful for the future patches to come with some bpf prog examples to do that.Absolutely; we plan to include example algorithm implementations as well!A weighted round robin queue setup might be a useful example and easy to implement/understand, but slightly more interesting than a pfifo. Also would force understanding multiple cpus and timer issues.
Yup, some sort of RR queueing is definitely on the list!
quoted
quoted
quoted
- It should be possible to structure the hooks in a way that allows reusing queueing algorithm implementations between the qdisc and XDP layers.quoted
You mention skb should not be exposed to userspace? Why? Whats the reason for this? Anyways we can make kernel only maps if we want or scrub the data before passing it to userspace. We do this already in some cases.Yup, that's my approach as well.Having something reported back to userspace as the value might be helpful for debugging/tracing. Maybe the skb->hash? Then you could set this and then track a skb through the stack even when its in a bpf skb queue.
Yeah. I've just been using the pointer value for my initial testing. That's not a good solution, of course, but having a visible identifier would be neat. skb->hash makes sense for the qdisc layer, but not for XDP...
quoted
quoted
quoted
quoted
IMO it seems cleaner and more general to allow sk_buffs to be stored in maps and pulled back out later for enqueue/dequeue.FWIW there's some gnarly details here (for instance, we need to make sure the BPF program doesn't leak packet references after they are dequeued from the map). My idea is to use a scheme similar to what we do for XDP_REDIRECT, where a helper sets some hidden variables and doesn't actually remove the packet from the queue until the BPF program exits (so the kernel can make sure things are accounted correctly).The verifier is tracking the sk's references. Can it be reused to track the skb's reference?I was vaguely aware that it does this, but have not looked at the details. Would be great if this was possible; will see how far I get with it, and iterate from there (with your help, hopefully :))Also might need to drop any socket references from the networking side so an enqueued sock can't hold a socket open.
Not sure I'm following you here? -Toke