Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting.

From: Peter Zijlstra <peterz@infradead.org>
Date: 2009-03-31 15:35:33
Also in: lkml

Possibly related (same subject, not in this thread)

2009-04-10 · Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting. · Andrew Morton <akpm@linux-foundation.org>
2009-03-31 · Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting. · Eric W. Biederman <hidden>
2009-03-31 · Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting. · Eric W. Biederman <hidden>
2009-03-31 · Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting. · Peter Zijlstra <peterz@infradead.org>
2009-03-31 · Re: [PATCH 2/2] sysctl: lockdep support for sysctl reference counting. · Peter Zijlstra <peterz@infradead.org>

On Tue, 2009-03-31 at 06:40 -0700, Eric W. Biederman wrote:

Peter Zijlstra [off-list ref] writes:

quoted

On Sat, 2009-03-21 at 00:42 -0700, Eric W. Biederman wrote:

quoted

It is possible for get lock ordering deadlocks between locks
and waiting for the sysctl used count to drop to zero.  We have
recently observed one of these in the networking code.

So teach the sysctl code how to speak lockdep so the kernel
can warn about these kinds of rare issues proactively.

It would be very good to extend this changelog with a more detailed
explanation of the deadlock in question.

Let me see if I got it right:

We're holding a lock, while waiting for the refcount to drop to 0.
Dropping that refcount is blocked on that lock.

Something like that?

Exactly.

I must have written an explanation so many times that it got
lost when I wrote that commit message.

In particular the problem can be see with /proc/sys/net/ipv4/conf/*/forwarding.

The problem is that the handler for fowarding takes the rtnl_lock
with the reference count held.

Then we call unregister_sysctl_table under the rtnl_lock.
which waits for the reference count to go to zero.

quoted

+
+#  define lock_sysctl() __raw_spin_lock(&sysctl_lock.raw_lock)
+#  define unlock_sysctl() __raw_spin_unlock(&sysctl_lock.raw_lock)

Uhmm, Please explain that -- without a proper explanation this is a NAK.

If the refcount is to be considered a lock.  sysctl_lock must be considered
the internals of that lock.  lockdep gets extremely confused otherwise.
Since the spinlock is static to this file I'm not especially worried
about it.

Usually lock internal locks still get lockdep coverage. Let see if we
can find a way for this to be true even here. I suspect the below to
cause the issue:

quoted

 /* called under sysctl_lock, will reacquire if has to wait */

@@ -1478,47 +1531,54 @@ static void start_unregistering(struct ctl_table_header *p)
 	 * if p->used is 0, nobody will ever touch that entry again;
 	 * we'll eliminate all paths to it before dropping sysctl_lock
 	 */
+	table_acquire(p);
 	if (unlikely(p->used)) {
 		struct completion wait;
+		table_contended(p);
+
 		init_completion(&wait);
 		p->unregistering = &wait;
-		spin_unlock(&sysctl_lock);
+		unlock_sysctl();
 		wait_for_completion(&wait);
-		spin_lock(&sysctl_lock);
+		lock_sysctl();
 	} else {
 		/* anything non-NULL; we'll never dereference it */
 		p->unregistering = ERR_PTR(-EINVAL);
 	}
+	table_acquired(p);
+
 	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
 	 */
 	list_del_init(&p->ctl_entry);
+
+	table_release(p);
 }

There you acquire the table while holding the spinlock, generating:
sysctl_lock -> table_lock, however you then release the sysctl_lock and
re-acquire it, generating table_lock -> sysctl_lock.

Humm, can't we write that differently?

quoted

@@ -1951,7 +2011,13 @@ struct ctl_table_header *__register_sysctl_paths(
 		return NULL;
 	}
 #endif
-	spin_lock(&sysctl_lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	{
+		static struct lock_class_key __key;
+		lockdep_init_map(&header->dep_map, "sysctl_used", &__key, 0);
+	}
+#endif	

This means every sysctl thingy gets the same class, is that
intended/desired?

There is only one place we initialize it, and as far as I know really
only one place we take it.  Which is the definition of a lockdep
class as far as I know.

Indeed, just checking.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help