Re: can/j1939: hung inside rtnl_dellink()
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: 2025-08-22 10:24:08
Also in:
linux-can, lkml
Subsystem:
can network layer, can-j1939 network layer, the rest · Maintainers:
Oliver Hartkopp, Marc Kleine-Budde, Robin van der Gracht, Oleksij Rempel, Linus Torvalds
(Adding netdev ML to ask for hints from different network protocols...) On 2025/08/22 18:51, Oleksij Rempel wrote:
Hello Tetsuo, On Sat, Aug 16, 2025 at 03:51:54PM +0900, Tetsuo Handa wrote:quoted
Hello. I made a minimized C reproducer for unregister_netdevice: waiting for vcan0 to become free. Usage count = 2 problem at https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 , and obtained some data using debug printk() patch. It seems that the cause is net/can/j1939/ does not handle NETDEV_UNREGISTER notification while net/can/j1939/ can directly call rtnl_dellink() via sendmsg().Sorry for long delay and than you for your investigation!quoted
The minimized C reproducer is shown below.....quoted
Therefore, I guess that either j1939_netdev_notify() is handling NETDEV_UNREGISTER notification
Oops. I wanted to write
j1939_netdev_notify() is *not* handling NETDEV_UNREGISTER notification
quoted hunk
quoted
or rtnl_dellink() can be called via sendmsg() despite the j1939 socket are in use is wrong. How to fix this problem?I assume the first variant is correct. Can you please test following change:--- a/net/can/j1939/main.c +++ b/net/can/j1939/main.c@@ -370,6 +370,7 @@ goto notify_done; switch (msg) { + case NETDEV_UNREGISTER: case NETDEV_DOWN: j1939_cancel_active_session(priv, NULL); j1939_sk_netdev_event_netdown(priv);
Such change is not sufficient. As far as I tested, the only way that can drop the refcount to 1 is to call j1939_sk_release() (which involves sock_put()) on all j1939 sockets (i.e. something like shown below).
diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv); /* notify/alert all j1939 sockets bound to ifindex */ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv); +void j1939_sk_netdev_event_unregister(struct j1939_priv *priv); int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk); void j1939_tp_init(struct j1939_priv *priv);
diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..e568b5928a39 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c@@ -377,6 +377,11 @@ static int j1939_netdev_notify(struct notifier_block *nb, j1939_sk_netdev_event_netdown(priv); j1939_ecu_unmap_all(priv); break; + case NETDEV_UNREGISTER: + pr_info("NETDEV_UNREGISTER notification on %px start\n", ndev); + j1939_sk_netdev_event_unregister(priv); + pr_info("NETDEV_UNREGISTER notification on %px end\n", ndev); + break; } j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 3d8b588822f9..4e53a1b10907 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c@@ -1300,6 +1313,24 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv) read_unlock_bh(&priv->j1939_socks_lock); } +void j1939_sk_netdev_event_unregister(struct j1939_priv *priv) +{ + struct j1939_sock *jsk; + struct socket sock = { }; + + rescan: + read_lock_bh(&priv->j1939_socks_lock); + list_for_each_entry(jsk, &priv->j1939_socks, list) { + read_unlock_bh(&priv->j1939_socks_lock); + pr_info("Releasing %px\n", &jsk->sk); + sock.sk = &jsk->sk; + //sock_hold(&jsk->sk); + j1939_sk_release(&sock); + goto rescan; + } + read_unlock_bh(&priv->j1939_socks_lock); +} + static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd, unsigned long arg) {
Of course, calling sk_j1939_sk_release() upon NETDEV_UNREGISTER event causes refcount underflow bug. But calling sock_hold() before calling j1939_sk_release() prevents the refcount from dropping to 1. :-( I think we need to somehow make it possible to logically close j1939 sockets without actually closing. Maybe something like "struct in_device"->dead flag which is set by inetdev_destroy() upon NETDEV_UNREGISTER event is needed by j1939 sockets... My build environment is very slow (testing on VMWare on a Windows PC). Running my simplified reproducer on your build environment would be much faster.