Re: [PATCH v6 1/6] Bluetooth: schedule SCO timeouts with delayed_work
From: Desmond Cheong Zhi Xi <hidden>
Date: 2021-09-02 22:54:38
Also in:
linux-bluetooth, linux-kernel-mentees, lkml
Subsystem:
bluetooth subsystem, the rest · Maintainers:
Marcel Holtmann, Luiz Augusto von Dentz, Linus Torvalds
On 2/9/21 5:41 pm, Eric Dumazet wrote:
quoted hunk ↗ jump to hunk
On 9/2/21 12:32 PM, Desmond Cheong Zhi Xi wrote:quoted
Hi Eric, This actually seems to be a pre-existing error in sco_sock_connect that we now hit in sco_sock_timeout. Any thoughts on the following patch to address the problem? Link: https://lore.kernel.org/lkml/20210831065601.101185-1-desmondcheongzx@gmail.com/ (local)syzbot is still working on finding a repro, this is obviously not trivial, because this is a race window. I think this can happen even with a single SCO connection. This might be triggered more easily forcing a delay in sco_sock_timeout()diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index 98a88158651281c9f75c4e0371044251e976e7ef..71ebe0243fab106c676c308724fe3a3f92a62cbd 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c@@ -84,8 +84,14 @@ static void sco_sock_timeout(struct work_struct *work) sco_conn_lock(conn); sk = conn->sk; - if (sk) + if (sk) { + // lets pretend cpu has been busy (in interrupts) for 100ms + int i; + for (i=0;i<100000;i++) + udelay(1); + sock_hold(sk); + }> sco_conn_unlock(conn); if (!sk)Stack trace tells us that sco_sock_timeout() is running after last reference on socket has been released. __refcount_add include/linux/refcount.h:199 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] sock_hold include/net/sock.h:702 [inline] sco_sock_timeout+0x216/0x290 net/bluetooth/sco.c:88 process_one_work+0x98d/0x1630 kernel/workqueue.c:2276 worker_thread+0x658/0x11f0 kernel/workqueue.c:2422 kthread+0x3e5/0x4d0 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 This is why I suggested to delay sock_put() to make sure this can not happen.diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index 98a88158651281c9f75c4e0371044251e976e7ef..bd0222e3f05a6bcb40cffe8405c9dfff98d7afde 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c@@ -195,10 +195,11 @@ static void sco_conn_del(struct hci_conn *hcon, int err) sco_sock_clear_timer(sk); sco_chan_del(sk, err); release_sock(sk); - sock_put(sk); /* Ensure no more work items will run before freeing conn. */ cancel_delayed_work_sync(&conn->timeout_work); + + sock_put(sk); } hcon->sco_data = NULL;
I see where you're going with this, but once sco_chan_del returns, any instance of sco_sock_timeout that hasn't yet called sock_hold will simply return, because conn->sk is NULL. Adding a delay to the sco_conn_lock critical section in sco_sock_timeout would not affect this because sco_chan_del clears conn->sk while holding onto the lock. The main reason that cancel_delayed_work_sync is run there is to make sure that we don't have a UAF on the SCO connection itself after we free conn. For a single SCO connection with well-formed channel, I think there can't be a race. Here's the reasoning: - For the timeout to be scheduled, a socket must have a channel with a connection. - When a channel between a socket and connection is established, the socket transitions from BT_OPEN to BT_CONNECTED, BT_CONNECT, or BT_CONNECT2. - For a socket to be released, it has to be zapped. For sockets that have a state of BT_CONNECTED, BT_CONNECT, or BT_CONNECT2, they are zapped only when the channel is deleted. - If the channel is deleted (which is protected by sco_conn_lock), then conn->sk is NULL, and sco_sock_timeout simply exits. If we had entered the critical section in sco_sock_timeout before the channel was deleted, then we increased the reference count on the socket, so it won't be freed until sco_sock_timeout is done. Hence, sco_sock_timeout doesn't race with the release of a socket that has a well-formed channel with a connection. But if multiple connections are allocated and overwritten in sco_sock_connect, then none of the above assumptions hold because the SCO connection can't be cleaned up (i.e. conn->sk cannot be set to NULL) when the associated socket is released. This scenario happens in the syzbot reproducer for the crash here: https://syzkaller.appspot.com/bug?id=bcc246d137428d00ed14b476c2068579515fe2bc That aside, upon taking a closer look, I think there is indeed a race lurking in sco_conn_del, but it's not the one that syzbot is hitting. Our sock_hold simply comes too late, and by the time it's called we might have already have freed the socket. So probably something like this needs to happen:
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index fa25b07120c9..ea18e5b56343 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c@@ -187,10 +187,11 @@ static void sco_conn_del(struct hci_conn *hcon, int err) /* Kill socket */ sco_conn_lock(conn); sk = conn->sk; + if (sk) + sock_hold(sk); sco_conn_unlock(conn); if (sk) { - sock_hold(sk); lock_sock(sk); sco_sock_clear_timer(sk); sco_chan_del(sk, err);