Re: TCP sender stuck despite receiving ACKs from the peer

From: Eric Dumazet <edumazet@google.com>
Date: 2025-10-31 09:07:02

On Thu, Oct 23, 2025 at 10:57 PM Eric Dumazet [off-list ref] wrote:

quoted hunk ↗ jump to hunk

On Thu, Oct 23, 2025 at 10:29 PM Eric Dumazet [off-list ref] wrote:

quoted

On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz [off-list ref] wrote:

quoted

On 10/3/25 18:24, Neal Cardwell wrote:
[...]

quoted

Thanks for the report!

A few thoughts:

[...]

quoted

(2) After that, would it be possible to try this test with a newer
kernel? You mentioned this is with kernel version 5.10.165, but that's
more than 2.5 years old at this point, and it's possible the bug has
been fixed since then.  Could you please try this test with the newest
kernel that is available in your distribution? (If you are forced to
use 5.10.x on your distribution, note that even with 5.10.x there is
v5.10.245, which was released yesterday.)

(3) If this bug is still reproducible with a recent kernel, would it
be possible to gather .pcap traces from both client and server,
including SYN and SYN/ACK? Sometimes it can be helpful to see the
perspective of both ends, especially if there are middleboxes
manipulating the packets in some way.

Thanks!

Best regards,
neal

Hi,

I want to give an update as we made some progress.

We tried with the 6.12.40 kernel, but it was much harder to reproduce
and we were not able to do a successful packet capture and reproduction
at the same time. So we went back to 5.10.165, added more tracing and
eventually figured out how the TCP connection got into the bad state.

This is a backtrace from the TCP stack calling down to the device driver:
  => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
  => dev_hard_start_xmit
  => sch_direct_xmit
  => __qdisc_run
  => __dev_queue_xmit
  => vlan_dev_hard_start_xmit
  => dev_hard_start_xmit
  => __dev_queue_xmit
  => ip_finish_output2
  => __ip_queue_xmit
  => __tcp_transmit_skb
  => tcp_write_xmit

tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
they get broken down into 45 packets of 1448 bytes each.

So the driver does not support TSO ? Quite odd in 2025...

One thing you want is to make sure your vlan device (the one without a
Qdisc on it)
advertizes tso support.

ethtool -k vlan0

quoted

These 45
packets eventually reach dev_hard_start_xmit, which is a simple loop
forwarding packets one by one. When the problem occurs, we see that
dev_hard_start_xmit transmits the initial N packets successfully, but
the remaining 45-N ones fail with error code 1. The loop runs to
completion and does not break.

The error code 1 from dev_hard_start_xmit gets returned through the call
stack up to tcp_write_xmit, which treats this as error and breaks its
own loop without advancing snd_nxt:

                if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
                        break; // <<< breaks here

repair:
                /* Advance the send_head.  This one is sent out.
                 * This call will increment packets_out.
                 */
                tcp_event_new_data_sent(sk, skb);

 From packet captures we can prove that the 45 packets show up on the
kernel device on the sender. In addition, the first N of those 45
packets show up on the kernel device on the peer. The connection is now
in the problem state where the peer is N packets ahead of the sender and
the sender thinks that it never those packets, leading to the problem as
described in my initial mail.

Furthermore, we noticed that the N-45 missing packets show up as drops
on the sender's kernel device:

vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
         [...]
         TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0

This device is a vlan device stacked on another device like this:

49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
UNKNOWN mode DEFAULT group default qlen 1000
     link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff

Eventually packets need to go through the device driver, which has only
a limited number of TX buffers. The driver implements flow control: when
it is about to exhaust its buffers, it stops TX by calling
netif_stop_queue. Once more buffers become available again, it resumes
TX by calling netif_wake_queue. From packet counters we can tell that
this is happening frequently.

At this point we suspected "qdisc noqueue" to be a factor, and indeed,
after adding a queue to vlan0 the problem no longer happened, although
there are still TX drops on the vlan0 device.

Missing queue or not, we think there is a disconnect between the device
driver API and the TCP stack. The device driver API only allows
transmitting packets one by one (ndo_start_xmit). The TCP stack operates
on larger segments that is breaks down into smaller pieces
(tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
write" condition which the network stack doesn't seem to handle well in
all cases.

Appreciate you comments,

Very nice analysis, very much appreciated.

I think the issue here is that __tcp_transmit_skb() trusts the return
of icsk->icsk_af_ops->queue_xmit()

An error means : the packet was _not_ sent at all.

Here, it seems that the GSO layer returns an error, even if some
segments were sent.
This needs to be confirmed and fixed, but in the meantime, make sure
vlan0 has TSO support.
It will also be more efficient to segment (if you ethernet device has
no TSO capability) at the last moment,
because all the segments will be sent in  the described scenario
thanks to qdisc requeues.

Could you try the following patch ?

Thanks again !

diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103

--- a/net/core/dev.c
+++ b/net/core/dev.c

@@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct

net_device *sb_dev)
                 * to -1 or to their cpu id, but not to our id.
                 */
                if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+                       struct sk_buff *orig;
+
                        if (dev_xmit_recursion())
                                goto recursion_alert;

@@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct

net_device *sb_dev)

                        HARD_TX_LOCK(dev, txq, cpu);

+                       orig = skb;
                        if (!netif_xmit_stopped(txq)) {
                                dev_xmit_recursion_inc();
                                skb = dev_hard_start_xmit(skb, dev, txq, &rc);

@@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,

struct net_device *sb_dev)
                        HARD_TX_UNLOCK(dev, txq);
                        net_crit_ratelimited("Virtual device %s asks
to queue packet!\n",
                                             dev->name);
+                       if (skb != orig) {
+                               /* If at least one packet was sent, we
must return NETDEV_TX_OK */
+                               rc = NETDEV_TX_OK;
+                               goto unlock;
+                       }
                } else {
                        /* Recursion is detected! It is possible,
                         * unfortunately

@@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct

net_device *sb_dev)
        }

        rc = -ENETDOWN;
+unlock:
        rcu_read_unlock_bh();

        dev_core_stats_tx_dropped_inc(dev);

Hi Christoph

Any progress on your side ?

Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help