Thread (15 messages) 15 messages, 2 authors, 2025-09-19

Re: [PATCH net-next v3 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()

From: luoxuanqiang <hidden>
Date: 2025-09-18 08:33:13

在 2025/9/17 12:36, Kuniyuki Iwashima 写道:
On Tue, Sep 16, 2025 at 8:27 PM luoxuanqiang [off-list ref] wrote:
quoted
在 2025/9/17 03:48, Kuniyuki Iwashima 写道:
quoted
On Tue, Sep 16, 2025 at 3:31 AM [off-list ref] wrote:
quoted
From: Xuanqiang Luo <redacted>

Since ehash lookups are lockless, if another CPU is converting sk to tw
concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
lookup failure.

The call trace map is drawn as follows:
     CPU 0                                CPU 1
     -----                                -----
                                       inet_twsk_hashdance_schedule()
                                       spin_lock()
                                       inet_twsk_add_node_rcu(tw, ...)
__inet_lookup_established()
(find tw, failure due to tw_refcnt = 0)
                                       __sk_nulls_del_node_init_rcu(sk)
                                       refcount_set(&tw->tw_refcnt, 3)
                                       spin_unlock()

By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
setting tw_refcnt, we ensure that tw is either fully initialized or not
visible to other CPUs, eliminating the race.

Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Xuanqiang Luo <redacted>
---
   net/ipv4/inet_timewait_sock.c | 15 ++++++---------
   1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b5426b8ee92..1ba20c4cb73b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -116,7 +116,7 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
          spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
          struct inet_bind_hashbucket *bhead, *bhead2;

-       /* Step 1: Put TW into bind hash. Original socket stays there too.
+       /* Put TW into bind hash. Original socket stays there too.
             Note, that any socket with inet->num != 0 MUST be bound in
             binding cache, even if it is closed.
           */
@@ -140,14 +140,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,

          spin_lock(lock);

-       /* Step 2: Hash TW into tcp ehash chain */
-       inet_twsk_add_node_rcu(tw, &ehead->chain);
-
-       /* Step 3: Remove SK from hash chain */
-       if (__sk_nulls_del_node_init_rcu(sk))
-               sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-
-
          /* Ensure above writes are committed into memory before updating the
           * refcount.
           * Provides ordering vs later refcount_inc().
@@ -162,6 +154,11 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
           */
          refcount_set(&tw->tw_refcnt, 3);

+       if (hlist_nulls_replace_init_rcu(&sk->sk_nulls_node, &tw->tw_node))
+               sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+       else
+               inet_twsk_add_node_rcu(tw, &ehead->chain);
When hlist_nulls_replace_init_rcu() returns false ?
When hlist_nulls_replace_init_rcu() returns false, it means
sk is unhashed,
and how does this happen ?

Here is under lock_sock() I think, for example, you can
find a lockdep annotation in the path:

tcp_time_wait_init
   tp->af_specific->md5_lookup / tcp_v4_md5_lookup
     tcp_md5_do_lookup
       __tcp_md5_do_lookup
         rcu_dereference_check(tp->md5sig_info, lockdep_sock_is_held(sk));

So, is there a path that unhashes socket without holding
lock_sock() ?
I'm not entirely sure about this point yet, because
inet_unhash() is called in too many places and uses
__sk_nulls_del_node_init_rcu() to unhash sockets without
explicitly requiring bh_lock_sock().

Until I can verify this, I'll keep the original check
for old socket unhashed state to ensure safety.

It would be great if you could confirm this behavior.

Thanks
Xuanqiang.
quoted
the replacement operation failed, we need
to insert tw, and this doesn't change the original logic.
quoted
quoted
+
          inet_twsk_schedule(tw, timeo);

          spin_unlock(lock);
--
2.25.1
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help