Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop... | netdev

Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

From: Simon Schippers <hidden>
Date: 2026-03-26 15:31:07
Also in: kvm, lkml, virtualization

On 3/26/26 03:41, Jason Wang wrote:

On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
[off-list ref] wrote:

quoted

On 3/24/26 11:14, Simon Schippers wrote:

quoted

On 3/24/26 02:47, Jason Wang wrote:

quoted

On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
[off-list ref] wrote:

quoted

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.

If producing an entry fails anyways due to a race, tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
LLTX is enabled and the transmit path operates without the usual locking.

The existing __tun_wake_queue() function wakes the netdev queue. Races
between this wakeup and the queue-stop logic could leave the queue
stopped indefinitely. To prevent this, a memory barrier is enforced
(as discussed in a similar implementation in [1]), followed by a recheck
that wakes the queue if space is already available.

If no qdisc is present, the previous tail-drop behavior is preserved.

I wonder if we need a dedicated TUN flag to enable this. With this new
flag, we can even prevent TUN from using noqueue (not sure if it's
possible or not).

Except of the slight regressions because of this patchset I do not see
a reason for such a flag.

I have never seen that the driver prevents noqueue. For example you can
set noqueue to your ethernet interface and under load you soon get

net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
                                           dev->name);

followed by a -ENETDOWN. And this is not prevented even though it is
clearly not something a user wants.

quoted

Benchmarks:
The benchmarks show a slight regression in raw transmission performance,
though no packets are lost anymore.

The previously introduced threshold to only wake after the queue stopped
and half of the ring was consumed showed to be a descent choice:
Waking the queue whenever a consume made space in the ring strongly
degrades performance for tap, while waking only when the ring is empty
is too late and also hurts throughput for tap & tap+vhost-net.
Other ratios (3/4, 7/8) showed similar results (not shown here), so
1/2 was chosen for the sake of simplicity for both tun/tap and
tun/tap+vhost-net.

Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.

Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster, slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)

+--------------------------+--------------+----------------+----------+
| 1 thread                 | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

+--------------------------+--------------+----------------+----------+
| 2 threads                | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/ (local)

Co-developed-by: Tim Gebauer <redacted>
Signed-off-by: Tim Gebauer <redacted>
Signed-off-by: Simon Schippers <redacted>
---
 drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b86582cc6cb6..9b7daec69acd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c

@@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
        struct netdev_queue *queue;
        struct tun_file *tfile;
        int len = skb->len;
+       bool qdisc_present;
+       int ret;

        rcu_read_lock();
        tfile = rcu_dereference(tun->tfiles[txq]);

@@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)

        nf_reset_ct(skb);

-       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+       queue = netdev_get_tx_queue(dev, txq);
+       qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+       spin_lock(&tfile->tx_ring.producer_lock);
+       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {

So, it's possible that the administrator is switching between noqueue
and another qdisc. So ptr_ring_produce() can fail here, do we need to
check that or not?

Do you mean that? My thoughts:

Switching from noqueue to some qdisc can cause a

net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
                                           dev->name);

followed by a return of -ENETDOWN in __dev_queue_xmit().
This is because tun_net_xmit detects some qdisc with

qdisc_present = !qdisc_txq_has_no_queue(queue);

and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
detect noqueue.

I am not sure how to solve this/if this has to be solved.
I do not see a proper way to avoid parallel execution of ndo_start_xmit
and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).

And from my understanding the veth implementation faces the same issue.

How about rechecking if a qdisc is connected?
This would avoid -ENETDOWN.

diff --git a/net/core/dev.c b/net/core/dev.c
index f48dc299e4b2..2731a1a70732 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c

@@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
                                if (is_list)
                                        rc = NETDEV_TX_OK;
                        }
+                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
                        HARD_TX_UNLOCK(dev, txq);
                        if (!skb) /* xmit completed */
                                goto out;

+                       /* Maybe a qdisc was connected in the meantime */
+                       if (qdisc_present) {
+                               kfree_skb(skb);
+                               goto out;
+                       }
+
                        net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
                                             dev->name);
                        /* NETDEV_TX_BUSY or queue was stopped */

Probably not, and we likely won't hit this warning because qdisc could
not be changed during ndo_start_xmit().

Okay.

I meant something like this:

1) set noqueue to tuntap
2) produce packets so tuntap is full
3) set e.g fq_codel to tuntap
4) then we can hit the failure of __ptr_ring_produce()

Rethink of the code, it looks just fine.

Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
qdisc attached.

quoted


Switching from some qdisc to noqueue is no problem I think.

quoted

+               netif_tx_stop_queue(queue);
+               /* Avoid races with queue wake-ups in __tun_wake_queue by
+                * waking if space is available in a re-check.
+                * The barrier makes sure that the stop is visible before
+                * we re-check.
+                */
+               smp_mb__after_atomic();

Let's document which barrier is paired with this.

I am basically copying the (old) logic of veth [1] proposed by
Jakub Kicinski. I must admit I am not 100% sure what it pairs with.

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/ (local)

So it looks like it implicitly tries to pair with tun_ring_consume():

1) spinlock(consumer_lock)
2) store NULL to ptr_ring // STORE
3) spinunlock(consumer_lock) // RELEASE
4) spinlock(consumer_lock) // ACQURE
5) check empty
6) spinunlock(consumer_lock)
7) netif_wakeup_queue() // test_and_set() which is an RMW

RELEASE + ACQUIRE implies a full barrier

Thanks.

I see several problems

1) Due to batch consumption, we may get spurious wakeups under heavy
load (we can try disabling batch consuming to see if it helps).

I assume that you mean the waking in the recheck of the producer happens
too often and then wakes too often. But this would just take slightly
more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
down the consumer?

Why would disabling batch consume help here?
Wouldn't it just decrease the consumer speed?

Apart from that I do not see a different method to do this recheck.
The ring producer is only safely able to do a !produce_peek (so a check
for !full).

The normal waking (after consuming half of the ring) should be fine IMO.

2) So the barriers don't help but would slow down the consuming
3) Two spinlocks were used instead of one, this is another reason we
will see a performance regression

You are right, I can change it to a single spin_lock. Apart from that
I do not see how the barriers/locking could be reduced further.

4) Tricky code that needs to be understood or at least requires a comment tweak.

Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
may not notice the real degradation.

So run pktgen with pg_set SHARED? I am pretty sure that the vhost
thread was always at 100% CPU so pktgen was not the bottleneck. And when
I had perf enabled I always saw that in my patched version not the
creation of SKB's took most CPU in pktgen but a different unnamed
function (I assume this is a waiting function).


Thank you!

quoted

+               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+                       netif_tx_wake_queue(queue);
+       }
+       spin_unlock(&tfile->tx_ring.producer_lock);
+
+       if (ret) {
+               /* If a qdisc is attached to our virtual device,
+                * returning NETDEV_TX_BUSY is allowed.
+                */
+               if (qdisc_present) {
+                       rcu_read_unlock();
+                       return NETDEV_TX_BUSY;
+               }
                drop_reason = SKB_DROP_REASON_FULL_RING;
                goto drop;
        }

        /* dev->lltx requires to do our own update of trans_start */
-       queue = netdev_get_tx_queue(dev, txq);
        txq_trans_cond_update(queue);

        /* Notify and wake up reader process */
--
2.43.0

Thanks

Thanks

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help