Re: [PATCH] tcp: check socket state before calling WARN_ON

From: Neal Cardwell <ncardwell@google.com>
Date: 2025-02-25 17:25:05
Also in: lkml
Subsystem: networking [general], networking [tcp], the rest · Maintainers: "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Neal Cardwell, Linus Torvalds

On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell [off-list ref] wrote:

quoted hunk ↗ jump to hunk

On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam [off-list ref] wrote:

quoted

Hi Neal,
Thank you for looking into this issue.
When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
We can provide any information you would like to inspect.

Thanks again for raising this issue, and providing all that data!

I've come up with a reproducer for this issue, and an explanation for
why this has only been seen on Android so far, and a theory about a
related socket leak issue, and a proposed fix for the WARN and the
socket leak.

Here is the scenario:

+ user process A has a socket in TCP_ESTABLISHED

+ user process A calls close(fd)

+ socket calls __tcp_close() and tcp_close_state() decides to enter
TCP_FIN_WAIT1 and send a FIN

+ FIN is lost and retransmitted, making the state:
---
 tp->packets_out = 1
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

+ someone invokes "ss" to --kill the socket using the functionality in
(1e64e298b8 "net: diag: Support destroying TCP sockets")

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a

 (note: this was added for Android, so would not be surprising to have
this inet_diag --kill run on Android)

+ the ss --kill causes a call to tcp_abort()

+ tcp_abort() calls tcp_write_queue_purge()

+ tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
retrans_out=1

+ tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state

+ ACK arrives and causes a WARN_ON from tcp_verify_left_out():

#define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)

because the state has:

 ---
 tcp_left_out(tp) = sacked_out + lost_out = 1
  tp->packets_out = 0
---

because the state is:

---
 tp->packets_out = 0
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

I guess perhaps one fix would be to just have tcp_write_queue_purge()
zero out those other fields:

---
 tp->sacked_out = 0
 tp->lost_out = 0
 tp->retrans_out = 0
---

However, there is a related and worse problem. Because this killed
socket has tp->packets_out, the next time the RTO timer fires,
tcp_retransmit_timer() notices !tp->packets_out is true, so it short
circuits and returns without setting another RTO timer or checking to
see if the socket should be deleted. So the tcp_sock is now sitting in
memory with no timer set to delete it. So we could leak a socket this
way. So AFAICT to fix this socket leak problem, perhaps we want a
patch like the following (not tested yet), so that we delete all
killed sockets immediately, whether they are SOCK_DEAD (orphans for
which the user already called close() or not) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 28cf19317b6c2..a266078b8ec8c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c

@@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
        local_bh_disable();
        bh_lock_sock(sk);

-       if (!sock_flag(sk, SOCK_DEAD)) {
-               if (tcp_need_reset(sk->sk_state))
-                       tcp_send_active_reset(sk, GFP_ATOMIC);
-               tcp_done_with_error(sk, err);
-       }
+       if (tcp_need_reset(sk->sk_state))
+               tcp_send_active_reset(sk, GFP_ATOMIC);
+       tcp_done_with_error(sk, err);

        bh_unlock_sock(sk);
        local_bh_enable();
-       tcp_write_queue_purge(sk);
        release_sock(sk);
        return 0;
 }

---

Actually, it seems like a similar fix was already merged into Linux v6.11:

bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort

Details below.

Youngmin, does your kernel have this bac76cf89816b fix? If not, can
you please cherry-pick this fix and retest?

Thanks!
neal

ps: details for bac76cf89816b:

commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
Author: Xueming Feng [off-list ref]
Date:   Mon Aug 26 18:23:27 2024 +0800

    tcp: fix forever orphan socket caused by tcp_abort

    We have some problem closing zero-window fin-wait-1 tcp sockets in our
    environment. This patch come from the investigation.

    Previously tcp_abort only sends out reset and calls tcp_done when the
    socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
    purging the write queue, but not close the socket and left it to the
    timer.

    While purging the write queue, tp->packets_out and sk->sk_write_queue
    is cleared along the way. However tcp_retransmit_timer have early
    return based on !tp->packets_out and tcp_probe_timer have early
    return based on !sk->sk_write_queue.

    This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
    and socket not being killed by the timers, converting a zero-windowed
    orphan into a forever orphan.

    This patch removes the SOCK_DEAD check in tcp_abort, making it send
    reset to peer and close the socket accordingly. Preventing the
    timer-less orphan from happening.

    According to Lorenzo's email in the v1 thread, the check was there to
    prevent force-closing the same socket twice. That situation is handled
    by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
    already closed.

    The -ENOENT code comes from the associate patch Lorenzo made for
    iproute2-ss; link attached below, which also conform to RFC 9293.

    At the end of the patch, tcp_write_queue_purge(sk) is removed because it
    was already called in tcp_done_with_error().

    p.s. This is the same patch with v2. Resent due to mis-labeled "changes
    requested" on patchwork.kernel.org.

    Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
    Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
    Signed-off-by: Xueming Feng [off-list ref]
    Tested-by: Lorenzo Colitti [off-list ref]
    Reviewed-by: Jason Xing [off-list ref]
    Reviewed-by: Eric Dumazet [off-list ref]
    Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
    Signed-off-by: Jakub Kicinski [off-list ref]

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03a342c9162b..831a18dc7aa6d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c

@@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
                /* Don't race with userspace socket closes such as tcp_close. */
                lock_sock(sk);

+       /* Avoid closing the same socket twice. */
+       if (sk->sk_state == TCP_CLOSE) {
+               if (!has_current_bpf_ctx())
+                       release_sock(sk);
+               return -ENOENT;
+       }
+
        if (sk->sk_state == TCP_LISTEN) {
                tcp_set_state(sk, TCP_CLOSE);
                inet_csk_listen_stop(sk);

@@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
        local_bh_disable();
        bh_lock_sock(sk);

-       if (!sock_flag(sk, SOCK_DEAD)) {
-               if (tcp_need_reset(sk->sk_state))
-                       tcp_send_active_reset(sk, GFP_ATOMIC,
-                                             SK_RST_REASON_NOT_SPECIFIED);
-               tcp_done_with_error(sk, err);
-       }
+       if (tcp_need_reset(sk->sk_state))
+               tcp_send_active_reset(sk, GFP_ATOMIC,
+                                     SK_RST_REASON_NOT_SPECIFIED);
+       tcp_done_with_error(sk, err);

        bh_unlock_sock(sk);
        local_bh_enable();
-       tcp_write_queue_purge(sk);
        if (!has_current_bpf_ctx())
                release_sock(sk);
        return 0;

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help