Re: rtnl_lock deadlock on 3.10
From: Steve Wise <hidden>
Date: 2013-09-05 15:14:46
Also in:
linux-rdma
Possibly related (same subject, not in this thread)
- 2013-09-09 · Re: rtnl_lock deadlock on 3.10 · Steve Wise <hidden>
- 2013-09-06 · Re: rtnl_lock deadlock on 3.10 · Shawn Bohrer <hidden>
- 2013-09-06 · Re: rtnl_lock deadlock on 3.10 · Shawn Bohrer <hidden>
- 2013-07-30 · Re: rtnl_lock deadlock on 3.10 · Steve Wise <hidden>
- 2013-07-29 · Re: rtnl_lock deadlock on 3.10 · Shawn Bohrer <hidden>
On 9/5/2013 5:02 AM, Bart Van Assche wrote:
On 07/30/13 14:54, Steve Wise wrote:quoted
On 7/29/2013 6:02 PM, Shawn Bohrer wrote:quoted
On Mon, Jul 15, 2013 at 09:38:19AM -0500, Shawn Bohrer wrote:quoted
On Wed, Jul 03, 2013 at 08:26:11PM +0300, Or Gerlitz wrote:quoted
On 03/07/2013 20:22, Shawn Bohrer wrote:quoted
On Wed, Jul 03, 2013 at 07:33:07AM +0200, Hannes Frederic Sowa wrote:quoted
On Wed, Jul 03, 2013 at 07:11:52AM +0200, Hannes Frederic Sowa wrote:quoted
On Tue, Jul 02, 2013 at 01:38:26PM +0000, Cong Wang wrote:quoted
On Tue, 02 Jul 2013 at 08:28 GMT, Hannes Frederic Sowa [off-list ref] wrote:quoted
On Mon, Jul 01, 2013 at 09:54:56AM -0500, Shawn Bohrer wrote:quoted
I've managed to hit a deadlock at boot a couple times while testing the 3.10 rc kernels. It seems to always happen when my network devices are initializing. This morning I updated to v3.10 and made a few config tweaks and so far I've hit it 4 out of 5 reboots. It looks like most processes are getting stuck on rtnl_lock. Below is a boot log with the soft lockup prints. Please let know if there is any other information I can provide:Could you try a build with CONFIG_LOCKDEP enabled?The problem is clear: ib_register_device() is called with rtnl_lock, but itself needs device_mutex, however, ib_register_client() first acquires device_mutex, then indirectly calls register_netdev() which takes rtnl_lock. Deadlock! One possible fix is always taking rtnl_lock before taking device_mutex, something like below:diff --git a/drivers/infiniband/core/device.cb/drivers/infiniband/core/device.c index 18c1ece..890870b 100644--- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c@@ -381,6 +381,7 @@ int ib_register_client(struct ib_client*client) { struct ib_device *device; + rtnl_lock(); mutex_lock(&device_mutex); list_add_tail(&client->list, &client_list);@@ -389,6 +390,7 @@ int ib_register_client(struct ib_client*client) client->add(device); mutex_unlock(&device_mutex); + rtnl_unlock(); return 0; }diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.cb/drivers/infiniband/ulp/ipoib/ipoib_main.c index b6e049a..5a7a048 100644--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c@@ -1609,7 +1609,7 @@ static struct net_device*ipoib_add_port(const char *format, goto event_failed; } - result = register_netdev(priv->dev); + result = register_netdevice(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", hca->name, port, result);Looks good to me. Shawn, could you test this patch?ib_unregister_device/ib_unregister_client would need the same change, too. I have not checked the other ->add() and ->remove() functions. Also cc'ed linux-rdma@vger.kernel.org, Roland Dreier.Cong's patch is missing the #include <linux/rtnetlink.h> but otherwise I've had 34 successful reboots with no deadlocks which is a good sign. It sounds like there are more paths that need to be audited and a proper patch submitted. I can do more testing later if needed. Thanks, ShawnGuys, I was a bit busy today looking into that, but I don't think we want the IB core layer (core/device.c) to use rtnl locking which is something that belongs to the network stack.Has anymore thought been put into a proper fix for this issue?I'm no expert in this area but I'm having a hard time seeing a different solution than the one Cong suggested. Just to be clear the deadlock I hit was between cxgb3 and the ipoib module, so I've Cc'd Steve Wise in case he has a better solution from the Chelsio side.I don't know of another way to resolve this. The rtnl lock is used in ipoib and mlx4 already. I think we should go forward with the proposed patch.(replying to an e-mail of one month ago) Hello, It would be appreciated if anyone could report what the current status of this issue is. I think a deadlock I ran into with kernels 3.10 and 3.11 and PCI pass-through is related to this issue. See also http://bugzilla.kernel.org/show_bug.cgi?id=60856 for the lockdep report. Thanks, Bart.
Roland, what do you think? As I've said, I think we should go ahead with using the rtnl lock in the core. Is there a complete patch available for review? looks like the original was a partial fix. Steve.