Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
From: Kuniyuki Iwashima <hidden>
Date: 2024-06-09 19:53:41
Subsystem:
bpf [l7 framework] (sockmap), networking [general], networking [sockets], networking [unix sockets], the rest · Maintainers:
John Fastabend, Jakub Sitnicki, "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima, Willem de Bruijn, Linus Torvalds
From: Michal Luczaj <redacted> Date: Sun, 9 Jun 2024 13:28:34 +0200
quoted hunk ↗ jump to hunk
On 6/4/24 18:52, Kuniyuki Iwashima wrote:quoted
When a SOCK_DGRAM socket connect()s to another socket, the both sockets' sk->sk_state are changed to TCP_ESTABLISHED so that we can register them to BPF SOCKMAP. (...)Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL: T0 bpf T1 connect ====== ========== WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED) sock_map_sk_state_allowed(sk) ... sk_pair = unix_peer(sk) sock_hold(sk_pair) sock_hold(newsk) smp_mb__after_atomic() unix_peer(sk) = newsk unix_state_unlock(sk) With mdelay(1) stuffed in unix_stream_connect(): [ 902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080 [ 902.277633] #PF: supervisor write access in kernel mode [ 902.277661] #PF: error_code(0x0002) - not-present page [ 902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0 [ 902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI [ 902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G D 6.10.0-rc1+ #130 [ 902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150 Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about something like@@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, /* Set credentials */ copy_peercred(sk, other); - sock->state = SS_CONNECTED; - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); sock_hold(newsk); + smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ + WRITE_ONCE(unix_peer(sk), newsk); + smp_wmb(); /* ensure peer is set before sk_state */ - smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ - unix_peer(sk) = newsk; + sock->state = SS_CONNECTED; + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); unix_state_unlock(sk);@@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); + smp_rmb(); + sk_pair = READ_ONCE(unix_peer(sk)); sock_hold(sk_pair); psock->sk_pair = sk_pair; }This should keep things ordered and lockless... I hope.
sock_map_update_elem() assumes that the socket is protected by lock_sock(), but AF_UNIX uses it only for the general path. So, I think we should fix sock_map_sk_state_allowed() and then use smp_store_release()/smp_load_acquire() rather than smp_[rw]mb() for unix_peer(sk). Could you test this with the mdelay(1) change ? Note that we need not touch sock->state. I have a patch for net-next that removes sock->state uses completely from AF_UNIX as we don't use it. Even unix_seq_show() depends on sk->sk_state. ---8<---
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..67794d2c7498 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c@@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk) if (sk_is_tcp(sk)) return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN); if (sk_is_stream_unix(sk)) - return (1 << sk->sk_state) & TCPF_ESTABLISHED; + return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED; return true; }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80846279de9f..a558745c7d76 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c@@ -1632,11 +1632,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, copy_peercred(sk, other); sock->state = SS_CONNECTED; - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); sock_hold(newsk); smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ - unix_peer(sk) = newsk; + smp_store_release(&unix_peer(sk), newsk); + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); unix_state_unlock(sk);
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..6d9ae8e63901 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c@@ -180,7 +180,7 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); + sk_pair = smp_load_acquire(&unix_peer(sk)); sock_hold(sk_pair); psock->sk_pair = sk_pair; } ---8<---
quoted hunk ↗ jump to hunk
Alternatively, maybe it would be better just to make BPF respect the unix state lock?@@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { + unix_state_lock(sk); sk_pair = unix_peer(sk); + unix_state_unlock(sk); sock_hold(sk_pair); psock->sk_pair = sk_pair;What do you think?
If we'd go this way, I'd change like this: ---8<---
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..1db42cfee70d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c@@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) { - struct sock *sk_pair; - /* Restore does not decrement the sk_pair reference yet because we must * keep the a reference to the socket until after an RCU grace period * and any pending sends have completed.
@@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); - sock_hold(sk_pair); - psock->sk_pair = sk_pair; + psock->sk_pair = unix_peer_get(sk); + if (WARN_ON_ONCE(!psock->sk_pair)) + return -EINVAL; } unix_stream_bpf_check_needs_rebuild(psock->sk_proto); ---8<---
And the _last_ option would be..., no :) ---8<---
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b6eedf7650da..c7e31bc3e95e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h@@ -94,8 +94,8 @@ struct unix_sock { #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk) #define unix_peer(sk) (unix_sk(sk)->peer) -#define unix_state_lock(s) spin_lock(&unix_sk(s)->lock) -#define unix_state_unlock(s) spin_unlock(&unix_sk(s)->lock) +#define unix_state_lock(s) lock_sock(s) +#define unix_state_unlock(s) release_sock(s) enum unix_socket_lock_class { U_LOCK_NORMAL, U_LOCK_SECOND, /* for double locking, see unix_state_double_lock(). */
@@ -108,7 +108,7 @@ enum unix_socket_lock_class { static inline void unix_state_lock_nested(struct sock *sk, enum unix_socket_lock_class subclass) { - spin_lock_nested(&unix_sk(sk)->lock, subclass); + lock_sock_nested(sk, subclass); } #define peer_wait peer_wq.wait ---8<---