Re: [RFC Patch net-next] net_sched: introduce eBPF based Qdisc

From: Toke Høiland-Jørgensen <hidden>
Date: 2021-09-02 22:28:04
Also in: bpf

John Fastabend [off-list ref] writes:

Toke Høiland-Jørgensen wrote:

quoted

Martin KaFai Lau [off-list ref] writes:

quoted

On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:

quoted

John Fastabend [off-list ref] writes:

quoted

Cong Wang wrote:

quoted

On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau [off-list ref] wrote:

quoted

Please explain more on this.  What is currently missing
to make qdisc in struct_ops possible?

I think you misunderstand this point. The reason why I avoid it is
_not_ anything is missing, quite oppositely, it is because it requires
a lot of work to implement a Qdisc with struct_ops approach, literally
all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
WIth current approach, programmers only need to implement two
eBPF programs (enqueue and dequeue).

_if_ it is using as a qdisc object/interface,
the patch "looks" easier because it obscures some of the ops/interface
from the bpf user.  The user will eventually ask for more flexibility
and then an on-par interface as the kernel's qdisc.  If there are some
common 'ops', the common bpf code can be shared as a library in userspace
or there is also kfunc call to call into the kernel implementation.
For existing kernel qdisc author,  it will be easier to use the same
interface also.

The question is if it's useful to provide the full struct_ops for
qdiscs? Having it would allow a BPF program to implement that interface
towards userspace (things like statistics, classes etc), but the
question is if anyone is going to bother with that given the wealth of
BPF-specific introspection tools already available?

If its a map value then you get all the goodness with normal map
inspection.

Yup, exactly, so why bother with struct_ops to implement all the other
qdisc ops (apart from enqueue/dequeue)?

quoted

My hope is that we can (longer term) develop some higher-level tools to
express queueing policies that can then generate the BPF code needed to
implement them. Or as a start just some libraries to make this easier,
which I think is also what you're hinting at here? :)

The P4 working group has thought about QOS and queuing from P4 side if
you want to think in terms of a DSL. Might be interesting and have
some benefits if you want to drop into hardware offload side. For example
compile to XDP for fast CPU architectures, Altera/Xilinx backend for FPGA or
switch silicon for others. This was always the dream on my side maybe
we've finally got close to actualizing it, 10 years later ;)

Yup, would love to see this! Let's just hope it doesn't take another
decade ;)

quoted

Another idea. Rather than work with qdisc objects which creates all
these issues with how to work with existing interfaces, filters, etc.
Why not create an sk_buff map? Then this can be used from the existing
egress/ingress hooks independent of the actual qdisc being used.

I agree. In fact, I'm working on doing just this for XDP, and I see no
reason why the map type couldn't be reused for skbs as well. Doing it
this way has a couple of benefits:

- It leaves more flexibility to BPF: want a simple FIFO queue? just
  implement that with a single queue map. Or do you want to build a full
  hierarchical queueing structure? Just instantiate as many queue maps
  as you need to achieve this. Etc.

Agree.  Regardless how the interface may look like,
I even think being able to queue/dequeue an skb into different bpf maps
should be the first thing to do here.  Looking forward to your patches.

Thanks! Guess I should go work on them, then :D

Happy to review any RFCs.

quoted

- The behaviour is defined entirely by BPF program behaviour, and does
  not require setting up a qdisc hierarchy in addition to writing BPF
  code.

Interesting idea.  If it does not need to use the qdisc object/interface
and be able to do the qdisc hierarchy setup in a programmable way, it may
be nice.  It will be useful for the future patches to come with some
bpf prog examples to do that.

Absolutely; we plan to include example algorithm implementations as well!

A weighted round robin queue setup might be a useful example and easy
to implement/understand, but slightly more interesting than a pfifo. Also
would force understanding multiple cpus and timer issues.

Yup, some sort of RR queueing is definitely on the list!

quoted

- It should be possible to structure the hooks in a way that allows
  reusing queueing algorithm implementations between the qdisc and XDP
  layers.

quoted

You mention skb should not be exposed to userspace? Why? Whats the
reason for this? Anyways we can make kernel only maps if we want or
scrub the data before passing it to userspace. We do this already in
some cases.

Yup, that's my approach as well.

Having something reported back to userspace as the value might be helpful
for debugging/tracing. Maybe the skb->hash? Then you could set this and
then track a skb through the stack even when its in a bpf skb queue.

Yeah. I've just been using the pointer value for my initial testing.
That's not a good solution, of course, but having a visible identifier
would be neat. skb->hash makes sense for the qdisc layer, but not for
XDP...

quoted

IMO it seems cleaner and more general to allow sk_buffs
to be stored in maps and pulled back out later for enqueue/dequeue.

FWIW there's some gnarly details here (for instance, we need to make
sure the BPF program doesn't leak packet references after they are
dequeued from the map). My idea is to use a scheme similar to what we do
for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
actually remove the packet from the queue until the BPF program exits
(so the kernel can make sure things are accounted correctly).

The verifier is tracking the sk's references.  Can it be reused to
track the skb's reference?

I was vaguely aware that it does this, but have not looked at the
details. Would be great if this was possible; will see how far I get
with it, and iterate from there (with your help, hopefully :))

Also might need to drop any socket references from the networking side
so an enqueued sock can't hold a socket open.

Not sure I'm following you here?

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help