[PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
From: Simon Schippers <hidden>
Date: 2026-02-04 15:44:09
Also in:
kvm, lkml, virtualization
On 2/3/26 04:48, Jason Wang wrote:
On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers [off-list ref] wrote:quoted
On 1/30/26 02:51, Jason Wang wrote:quoted
On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers [off-list ref] wrote:quoted
On 1/29/26 02:14, Jason Wang wrote:quoted
On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers [off-list ref] wrote:quoted
On 1/28/26 08:03, Jason Wang wrote:quoted
On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers [off-list ref] wrote:quoted
On 1/23/26 10:54, Simon Schippers wrote:quoted
On 1/23/26 04:05, Jason Wang wrote:quoted
On Thu, Jan 22, 2026 at 1:35 PM Jason Wang [off-list ref] wrote:quoted
On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers [off-list ref] wrote:quoted
On 1/9/26 07:02, Jason Wang wrote:quoted
On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers [off-list ref] wrote:quoted
On 1/8/26 04:38, Jason Wang wrote:quoted
On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers [off-list ref] wrote:quoted
Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume() and wake the corresponding netdev subqueue when consuming an entry frees space in the underlying ptr_ring. Stopping of the netdev queue when the ptr_ring is full will be introduced in an upcoming commit. Co-developed-by: Tim Gebauer <redacted> Signed-off-by: Tim Gebauer <redacted> Signed-off-by: Simon Schippers <redacted> --- drivers/net/tap.c | 23 ++++++++++++++++++++++- drivers/net/tun.c | 25 +++++++++++++++++++++++-- 2 files changed, 45 insertions(+), 3 deletions(-)diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 1197f245e873..2442cf7ac385 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c@@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q, return ret ? ret : total; } +static void *tap_ring_consume(struct tap_queue *q) +{ + struct ptr_ring *ring = &q->ring; + struct net_device *dev; + void *ptr; + + spin_lock(&ring->consumer_lock); + + ptr = __ptr_ring_consume(ring); + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) { + rcu_read_lock(); + dev = rcu_dereference(q->tap)->dev; + netif_wake_subqueue(dev, q->queue_index); + rcu_read_unlock(); + } + + spin_unlock(&ring->consumer_lock); + + return ptr; +} + static ssize_t tap_do_read(struct tap_queue *q, struct iov_iter *to, int noblock, struct sk_buff *skb)@@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q, TASK_INTERRUPTIBLE); /* Read frames from the queue */ - skb = ptr_ring_consume(&q->ring); + skb = tap_ring_consume(q); if (skb) break; if (noblock) {diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 8192740357a0..7148f9a844a4 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c@@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun, return total; } +static void *tun_ring_consume(struct tun_file *tfile) +{ + struct ptr_ring *ring = &tfile->tx_ring; + struct net_device *dev; + void *ptr; + + spin_lock(&ring->consumer_lock); + + ptr = __ptr_ring_consume(ring); + if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {I guess it's the "bug" I mentioned in the previous patch that leads to the check of __ptr_ring_consume_created_space() here. If it's true, another call to tweak the current API.quoted
+ rcu_read_lock(); + dev = rcu_dereference(tfile->tun)->dev; + netif_wake_subqueue(dev, tfile->queue_index);This would cause the producer TX_SOFTIRQ to run on the same cpu which I'm not sure is what we want.What else would you suggest calling to wake the queue?I don't have a good method in my mind, just want to point out its implications.I have to admit I'm a bit stuck at this point, particularly with this aspect. What is the correct way to pass the producer CPU ID to the consumer? Would it make sense to store smp_processor_id() in the tfile inside tun_net_xmit(), or should it instead be stored in the skb (similar to the XDP bit)? In the latter case, my concern is that this information may already be significantly outdated by the time it is used. Based on that, my idea would be for the consumer to wake the producer by invoking a new function (e.g., tun_wake_queue()) on the producer CPU via smp_call_function_single(). Is this a reasonable approach?I'm not sure but it would introduce costs like IPI.quoted
More generally, would triggering TX_SOFTIRQ on the consumer CPU be considered a deal-breaker for the patch set?It depends on whether or not it has effects on the performance. Especially when vhost is pinned.I meant we can benchmark to see the impact. For example, pin vhost to a specific CPU and the try to see the impact of the TX_SOFTIRQ. ThanksI ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ... for both the stock and patched versions. The benchmarks were run with the full patch series applied, since testing only patches 1-3 would not be meaningful - the queue is never stopped in that case, so no TX_SOFTIRQ is triggered. Compared to the non-pinned CPU benchmarks in the cover letter, performance is lower for pktgen with a single thread but higher with four threads. The results show no regression for the patched version, with even slight performance improvements observed: +-------------------------+-----------+----------------+ | pktgen benchmarks to | Stock | Patched with | | Debian VM, i5 6300HQ, | | fq_codel qdisc | | 100M packets | | | | vhost pinned to core 0 | | | +-----------+-------------+-----------+----------------+ | TAP | Transmitted | 452 Kpps | 454 Kpps | | + +-------------+-----------+----------------+ | vhost-net | Lost | 1154 Kpps | 0 | +-----------+-------------+-----------+----------------+ +-------------------------+-----------+----------------+ | pktgen benchmarks to | Stock | Patched with | | Debian VM, i5 6300HQ, | | fq_codel qdisc | | 100M packets | | | | vhost pinned to core 0 | | | | *4 threads* | | | +-----------+-------------+-----------+----------------+ | TAP | Transmitted | 71 Kpps | 79 Kpps | | + +-------------+-----------+----------------+ | vhost-net | Lost | 1527 Kpps | 0 | +-----------+-------------+-----------+----------------+The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in the guest or an xdp program that did XDP_DROP in the guest.I forgot to mention that these PPS values are per thread. So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps, respectively. For packet loss, that comes out to 1154 Kpps * 4 = 4616 Kpps and 0, respectively. Sorry about that! The pktgen benchmarks with a single thread look fine, right?Still looks very low. E.g I just have a run of pktgen (using pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest, I can get 1Mpps.Keep in mind that I am using an older CPU (i5-6300HQ). For the single-threaded tests I always used pktgen_sample01_simple.sh, and for the multi-threaded tests I always used pktgen_sample02_multiqueue.sh. Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even though the same parameters work fine for sample01 and sample02): samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000 /samples/pktgen/functions.sh: line 79: echo: write error: Operation not supported ERROR: Write error(1) occurred cmd: "burst 32 > /proc/net/pktgen/tap0@0" ...and I do not know what I am doing wrong, even after looking at Documentation/networking/pktgen.rst. Every burst size except 1 fails. Any clues?Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.I tried using "-b 0", and while it worked, there was no noticeable performance improvement.quoted
Another thing I can think of is to disable 1) mitigations in both guest and host 2) any kernel debug features in both host and guestI also rebuilt the kernel with everything disabled under "Kernel hacking", but that didn’t make any difference either. Because of this, I ran "pktgen_sample01_simple.sh" and "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net, with very similar performance between the stock and patched kernels. Personally, I think the low performance is to blame on the hardware.Let's double confirm this by: 1) make sure pktgen is using 100% CPU 2) Perf doesn't show anything strange for pktgen thread Thanks
I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
100 second perf stat measurement covering all kpktgend threads.
Across all configurations, a single CPU was fully utilized.
Apart from that, the patched variants show a higher branch frequency and
a slightly increased number of context switches.
The detailed results are provided below:
Processor: Ryzen 5 5600X
pktgen command:
sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
perf stat command:
sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
Results:
Stock TAP:
46.997 context-switches # 467,2 cs/sec cs_per_second
0 cpu-migrations # 0,0 migrations/sec migrations_per_second
0 page-faults # 0,0 faults/sec page_faults_per_second
100.587,69 msec task-clock # 1,0 CPUs CPUs_utilized
8.491.586.483 branch-misses # 10,9 % branch_miss_rate (50,24%)
77.734.761.406 branches # 772,8 M/sec branch_frequency (66,85%)
382.420.291.585 cpu-cycles # 3,8 GHz cycles_frequency (66,85%)
377.612.185.141 instructions # 1,0 instructions insn_per_cycle (66,85%)
84.012.185.936 stalled-cycles-frontend # 0,22 frontend_cycles_idle (66,35%)
100,100414494 seconds time elapsed
Stock TAP+vhost-net:
47.087 context-switches # 468,1 cs/sec cs_per_second
0 cpu-migrations # 0,0 migrations/sec migrations_per_second
0 page-faults # 0,0 faults/sec page_faults_per_second
100.594,09 msec task-clock # 1,0 CPUs CPUs_utilized
8.034.703.613 branch-misses # 11,1 % branch_miss_rate (50,24%)
72.477.989.922 branches # 720,5 M/sec branch_frequency (66,86%)
382.218.276.832 cpu-cycles # 3,8 GHz cycles_frequency (66,85%)
349.555.577.281 instructions # 0,9 instructions insn_per_cycle (66,85%)
83.917.644.262 stalled-cycles-frontend # 0,22 frontend_cycles_idle (66,35%)
100,100520402 seconds time elapsed
Patched TAP:
47.862 context-switches # 475,8 cs/sec cs_per_second
0 cpu-migrations # 0,0 migrations/sec migrations_per_second
0 page-faults # 0,0 faults/sec page_faults_per_second
100.589,30 msec task-clock # 1,0 CPUs CPUs_utilized
9.337.258.794 branch-misses # 9,4 % branch_miss_rate (50,19%)
99.518.421.676 branches # 989,4 M/sec branch_frequency (66,85%)
382.508.244.894 cpu-cycles # 3,8 GHz cycles_frequency (66,85%)
312.582.270.975 instructions # 0,8 instructions insn_per_cycle (66,85%)
76.338.503.984 stalled-cycles-frontend # 0,20 frontend_cycles_idle (66,39%)
100,101262454 seconds time elapsed
Patched TAP+vhost-net:
47.892 context-switches # 476,1 cs/sec cs_per_second
0 cpu-migrations # 0,0 migrations/sec migrations_per_second
0 page-faults # 0,0 faults/sec page_faults_per_second
100.581,95 msec task-clock # 1,0 CPUs CPUs_utilized
9.083.588.313 branch-misses # 10,1 % branch_miss_rate (50,28%)
90.300.124.712 branches # 897,8 M/sec branch_frequency (66,85%)
382.374.510.376 cpu-cycles # 3,8 GHz cycles_frequency (66,85%)
340.089.181.199 instructions # 0,9 instructions insn_per_cycle (66,85%)
78.151.408.955 stalled-cycles-frontend # 0,20 frontend_cycles_idle (66,31%)
100,101212911 seconds time elapsed
quoted
Thanks!quoted
Thanksquoted
Thanks!quoted
quoted
I'll still look into using an XDP program that does XDP_DROP in the guest. Thanks!Thanksquoted
quoted
quoted
quoted
+------------------------+-------------+----------------+ | iperf3 TCP benchmarks | Stock | Patched with | | to Debian VM 120s | | fq_codel qdisc | | vhost pinned to core 0 | | | +------------------------+-------------+----------------+ | TAP | 22.0 Gbit/s | 22.0 Gbit/s | | + | | | | vhost-net | | | +------------------------+-------------+----------------+ +---------------------------+-------------+----------------+ | iperf3 TCP benchmarks | Stock | Patched with | | to Debian VM 120s | | fq_codel qdisc | | vhost pinned to core 0 | | | | *4 iperf3 client threads* | | | +---------------------------+-------------+----------------+ | TAP | 21.4 Gbit/s | 21.5 Gbit/s | | + | | | | vhost-net | | | +---------------------------+-------------+----------------+What are your thoughts on this? Thanks!Thanks