Re: [PATCH net-next v3 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
From: luoxuanqiang <hidden>
Date: 2025-09-18 08:33:13
在 2025/9/17 12:36, Kuniyuki Iwashima 写道:
On Tue, Sep 16, 2025 at 8:27 PM luoxuanqiang [off-list ref] wrote:quoted
在 2025/9/17 03:48, Kuniyuki Iwashima 写道:quoted
On Tue, Sep 16, 2025 at 3:31 AM [off-list ref] wrote:quoted
From: Xuanqiang Luo <redacted> Since ehash lookups are lockless, if another CPU is converting sk to tw concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause lookup failure. The call trace map is drawn as follows: CPU 0 CPU 1 ----- ----- inet_twsk_hashdance_schedule() spin_lock() inet_twsk_add_node_rcu(tw, ...) __inet_lookup_established() (find tw, failure due to tw_refcnt = 0) __sk_nulls_del_node_init_rcu(sk) refcount_set(&tw->tw_refcnt, 3) spin_unlock() By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after setting tw_refcnt, we ensure that tw is either fully initialized or not visible to other CPUs, eliminating the race. Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Xuanqiang Luo <redacted> --- net/ipv4/inet_timewait_sock.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 5b5426b8ee92..1ba20c4cb73b 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c@@ -116,7 +116,7 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw, spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash); struct inet_bind_hashbucket *bhead, *bhead2; - /* Step 1: Put TW into bind hash. Original socket stays there too. + /* Put TW into bind hash. Original socket stays there too. Note, that any socket with inet->num != 0 MUST be bound in binding cache, even if it is closed. */@@ -140,14 +140,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw, spin_lock(lock); - /* Step 2: Hash TW into tcp ehash chain */ - inet_twsk_add_node_rcu(tw, &ehead->chain); - - /* Step 3: Remove SK from hash chain */ - if (__sk_nulls_del_node_init_rcu(sk)) - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); - - /* Ensure above writes are committed into memory before updating the * refcount. * Provides ordering vs later refcount_inc().@@ -162,6 +154,11 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw, */ refcount_set(&tw->tw_refcnt, 3); + if (hlist_nulls_replace_init_rcu(&sk->sk_nulls_node, &tw->tw_node)) + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); + else + inet_twsk_add_node_rcu(tw, &ehead->chain);When hlist_nulls_replace_init_rcu() returns false ?When hlist_nulls_replace_init_rcu() returns false, it means sk is unhashed,and how does this happen ? Here is under lock_sock() I think, for example, you can find a lockdep annotation in the path: tcp_time_wait_init tp->af_specific->md5_lookup / tcp_v4_md5_lookup tcp_md5_do_lookup __tcp_md5_do_lookup rcu_dereference_check(tp->md5sig_info, lockdep_sock_is_held(sk)); So, is there a path that unhashes socket without holding lock_sock() ?
I'm not entirely sure about this point yet, because inet_unhash() is called in too many places and uses __sk_nulls_del_node_init_rcu() to unhash sockets without explicitly requiring bh_lock_sock(). Until I can verify this, I'll keep the original check for old socket unhashed state to ensure safety. It would be great if you could confirm this behavior. Thanks Xuanqiang.
quoted
the replacement operation failed, we need to insert tw, and this doesn't change the original logic.quoted
quoted
+ inet_twsk_schedule(tw, timeo); spin_unlock(lock); -- 2.25.1