Thread (28 messages) 28 messages, 6 authors, 8d ago

Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers <hidden>
Date: 2026-05-19 20:57:48
Also in: bpf, lkml

On 5/12/26 23:55, Simon Schippers wrote:
On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
quoted
quoted
quoted
Nope, I'm using a bpftrace program to keep track of the inflight/limit
in a BPF hashmap.  Reading from /sys will not be accurate.
Ah nice.
Add the option --hist to have both NAPI and BQL histograms printed when
script ends.  This will give you an accurate pattern of how inflight and
limit evolves.
quoted
quoted
I moved the selftests into a github repo [1] to allow us to collaborate
and evaluate the changes more easily.  I explicitly kept the new BPF
based BQL tracking as a commit[2] for your benefit.

  [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests

  [2] https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
Thanks for sharing. After minor issues I was able to set it up
(currently I am just using plain v5, will look at the coalescing patch
when I find the time):

Can confirm the latency reduction with the default settings, in my case
4.888ms to 0.241ms.

With the same script I was also able to see a performance slow down:
veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
--> ~510 Kpps
Same with --bql-disable
--> ~570 Kpps
--> 12% faster
Thanks for running these benchmarks.

Notice that --nrules 0 can easily result in no-queuing (on average),
because the veth NAPI consumer is faster than the producer.  You will
likely see BQL inflight=1 and sink reported avg latency very low
(remember it okay that sink get high latency penalty as long at ping
latency remains low, as that show AQM is working).
I ran the benchmarks with --hist and I see what you mean.
I have very similar results.

Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
that the producer is faster than the consumer?

[1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/ (local)
quoted
Hi, so what I found is that pktgen does not respect
__QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
just sent packets even if the BQL "stopped" the queue. So I patched
pktgen with the following:

-       if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
+       if (unlikely(netif_xmit_frozen_or_stopped(txq))) {

After thinking more about the implementation I see possible issues:

1. netdev_tx_completed_queue() never reports more than burst=64 packets:

BQL only increments the limit if the queue was starved. That means:
"The queue was over-limit in the last interval (the last time completion
processing ran), and there is no more data in the queue (i.e. it’s
empty)" [2]
But as only 64 packets are reported at max, the queue can only grow when
it is <= 64 packets. And then it can only stay at a limit >64 until the
next decrease of the limit. 


2. netdev_tx_completed_queue() is called in irregular intervals:

If the consumer is slow it is called approx each tx_coal_usecs.
But if the consumer is fast it is called way more frequent, probably
in irregular intervals depending on the scheduling.
However, "BQL depends on periodic completion interrupts" [2].

--> How about adding something like an interrupt that triggers every
    10us and calls netdev_tx_completed_queue() with n_bql collected from
    (multiple) veth_xdp_rcv runs? That could solve 1. and 2. 
Hi,

I worked on a new version (see attachment) that addresses both issues.

The major change is that instead of tracking the timestamp and packet
count as local variables in veth_xdp_rcv(), they are now stored
persistently in veth_rq as struct veth_bql_state. This allows completions
to accumulate across multiple NAPI poll calls, so
netdev_tx_completed_queue() can report more than 64 packets at once
(see point 1). To get the time I am using (the fast) sched_clock() with
a trick to avoid issues when switching between CPUs.

For point 2, the coalescing deadline is now checked both before the
receive loop (to flush completions that timed out since the previous
poll) and after each consumed packet, making completion intervals more
regular. Still the intervals can be smaller than
VETH_BQL_COAL_TX_USECS, but I guess this is fine.

I also found out that the BQL limit correlates closely with
VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
targeting. I raised the default to 100 µs to allow DQL to converge to a
higher limit (for reaching 255 in the testing below).

With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
shows:
- --nrules 0: DQL limit reaches (up to) ~255
- --nrules 10000: DQL limit converges to ~0 (with --gro-disable)

These results are what I would expect from a BQL algorithm, but more
testing is needed of course.

What do you think?

Thanks!

BTW: I think that this implementation could also work for other
     software interfaces.
[2] Link: https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
quoted
There is an important gotcha. We actually have micro-burst of queuing
(likely due to scheduling noise). Reading BQL stats from /sys will show
BQL inflight=1, but when using the option --hist is it visible that
@inflight have a long tail (see below signature).  The "qdisc" output
line also shows this happening via requeues increasing (approx 17/sec in
a test with 567Kpps). (this was with the time-based BQL impl).
I understand..

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help