Thread (58 messages) 58 messages, 10 authors, 2020-02-05

Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path

From: Toke Høiland-Jørgensen <hidden>
Date: 2020-02-04 11:00:50

Jesper Dangaard Brouer [off-list ref] writes:
On Mon, 03 Feb 2020 21:13:24 +0100
Toke Høiland-Jørgensen [off-list ref] wrote:
quoted
Oops, I see I forgot to reply to this bit:
quoted
quoted
Yeah, but having the low-level details available to the XDP program
(such as HW queue occupancy for the egress hook) is one of the benefits
of XDP, isn't it?  
I think I glossed over the hope for having access to HW queue occupancy
- what exactly are you after? 

I don't think one can get anything beyond a BQL type granularity.
Reading over PCIe is out of question, device write back on high
granularity would burn through way too much bus throughput.  
This was Jesper's idea originally, so maybe he can explain better; but
as I understood it, he basically wanted to expose the same information
that BQL has to eBPF. Making it possible for an eBPF program to either
(re-)implement BQL with its own custom policy, or react to HWQ pressure
in some other way, such as by load balancing to another interface.
Yes, and I also have plans that goes beyond BQL. But let me start with
explaining the BQL part, and answer Toke's question below.

On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:
quoted
[...] Hmm, I wonder if a TX driver hook is enough?
Short answer is no, a TX driver hook is not enough.  The queue state
info the TX driver hook have access to, needs to be updated once the
hardware have "confirmed" the TX-DMA operation have completed.  For
BQL/DQL this update happens during TX-DMA completion/cleanup (code
see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
point out we cannot query the device directly due to performance
implications).  It doesn't need to be a new BPF hook, just something
that update the queue state info (we could piggy back on the
netdev_tx_completed_queue() call or give TX hook access to
dev_queue->dql).
The question is whether this can't simply be done through bpf helpers?
bpf_get_txq_occupancy(ifindex, txqno)?
Regarding "where is the queue": For me the XDP-TX queue is the NIC
hardware queue, that this BPF hook have some visibility into and can do
filtering on. (Imagine that my TX queue is bandwidth limited, then I
can shrink the packet size and still send a "congestion" packet to my
receiver).
I'm not sure the hardware queues will be enough, though. Unless I'm
misunderstanding something, hardware queues are (1) fairly short and (2)
FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
still need a software queueing layer on top of the hardware queue.

If the hardware is EDT-aware this may change, I suppose, but I'm not
sure if we can design the XDP queueing primitives with this assumption? :)
The bigger picture is that I envision the XDP-TX/egress hook can
open-up for taking advantage of NIC hardware TX queue features. This
also ties into the queue abstraction work by Björn+Magnus. Today NIC
hardware can do a million TX-queues, and hardware can also do rate
limiting per queue. Thus, I also envision that the XDP-TX/egress hook
can choose/change the TX queue the packet is queue/sent on (we can
likely just overload the XDP_REDIRECT and have a new bpf map type for
this).
Yes, I think I mentioned in another email that putting all the queueing
smarts into the redirect map was also something I'd considered (well, I
do think we've discussed this in the past, so maybe not so surprising if
we're thinking along the same lines) :)

But the implication of this is also that an actual TX hook in the driver
need not necessarily incorporate a lot of new functionality, as it can
control the queueing through a combination of BPF helpers and map
updates?

-Toke
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help