Re: [PATCH RFC] v2 expedited "big hammer" RCU grace periods

From: Paul E. McKenney <hidden>
Date: 2009-04-27 16:16:49
Also in: lkml, netfilter-devel

On Mon, Apr 27, 2009 at 11:54:24AM -0400, Mathieu Desnoyers wrote:

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:

quoted

On Mon, Apr 27, 2009 at 05:26:39AM +0200, Ingo Molnar wrote:

quoted

* Paul E. McKenney [off-list ref] wrote:

quoted

On Sun, Apr 26, 2009 at 10:22:55PM +0200, Ingo Molnar wrote:

quoted

* Mathieu Desnoyers [off-list ref] wrote:

quoted

* Ingo Molnar (mingo@elte.hu) wrote:

quoted

* Paul E. McKenney [off-list ref] wrote:

quoted

Second cut of "big hammer" expedited RCU grace periods, but only 
for rcu_bh.  This creates another softirq vector, so that entering 
this softirq vector will have forced an rcu_bh quiescent state (as 
noted by Dave Miller).  Use smp_call_function() to invoke 
raise_softirq() on all CPUs in order to cause this to happen.  
Track the CPUs that have passed through a quiescent state (or gone 
offline) with a cpumask.

hm, i'm still asking whether doing this would be simpler via a 
reschedule vector - which not only is an existing facility but also 
forces all RCU domains through a quiescent state - not just bh-RCU 
participants.

Triggering a new softirq is in no way simpler that doing an SMP 
cross-call - in fact softirqs are a finite resource so using some 
other facility would be preferred.

Am i missing something?

I think the reason for this whole thread is that waiting for rcu 
quiescent state, when called many times e.g. in multiple iptables 
invokations, takes too longs (5 seconds to load the netfilter 
rules at boot). [...]

I'm aware of the problem space.

I was suggesting that to trigger the quiescent state and to wait for 
it to propagate it would be enough to reuse the reschedule 
mechanism.

It would be relatively straightforward: first a send-reschedule then 
do a wait_task_context_switch() on rq->curr - both are existing 
primitives. (a task reference has to be taken but that's pretty much 
all)

Well, one reason I didn't take this approach was that I didn't 
happen to think of it.  ;-)

Also that I hadn't heard of wait_task_context_switch().

Hmmm...  Looking for wait_task_context_switch().  OK, found it.

It looks to me that this primitive won't return until the 
scheduler actually decides to run something else.  We instead need 
to have something that stops waiting once the CPU enters the 
scheduler, hence the previous thought of making rcu_qsctr_inc() do 
a bit of extra work.

This would be a way of making an expedited RCU-sched across all 
RCU implementations.  As noted in the earlier email, it would not 
handle RCU or RCU-bh in a -rt kernel.

quoted

By the time wait_task_context_switch() returns from the last CPU 
we know that the quiescent state has passed.

We would want to wait for all of the CPUs in parallel, though, 
wouldn't we?  Seems that we would not want to wait for the last 
CPU to do another trip through the scheduler if it had already 
passed through the scheduler while we were waiting on the earlier 
CPUs.

So it seems like we would still want a two-pass approach -- one 
pass to capture the current state, the second pass to wait for the 
state to change.

I think waiting in parallel is still possible (first kick all tasks, 
then make sure all tasks have left the CPU at least once).

The busy-waiting in wait_task_context_switch() is indeed a problem - 
but perhaps that could be refactored to be a migration-thread driven 
wait_for_completion() + complete() cycle? It could be driven by 
preempt notifiers perhaps - and become zero-cost.

Hmmm...  It would need to be informed of the quiescent state even if
that quiescent state did not result in a preemption.

But you are right -- I do need to expedite RCU, not just RCU-bh,
especially given that the boot-speed guys are starting to see grace
periods as a measureable fraction of the boot time.  I will take another
pass at this.

It might sound a bit simplistic, but... scheduling a high-priority
workqueue on every CPUs would give you the guarantees you seem to need
here. Or is the delay of letting the scheduler schedule a high-priority
task a delay you are trying to avoid ?

Some kind of priority boosting done by synchronize_rcu() could probably
work, and you could support rcu callbacks priority boosting by assigning
a priority to each callback registered (same priority as the thread
which invoked call_rcu). The rcu callbacks could then be sorted by
priority in a RB tree, and only the callbacks associated with priority

quoted

= than the next priority task would be executed.

I did something similar for the implementation of synchronize_sched()
in preemptable RCU.  The interactions with CPU hotplug are a bit ugly.
It will be easier to hook into rcu_qsctr_inc().  ;-)

But this discussion has been quite useful -- my thoughts for the design
of the long-term solution were a bit lacking, as they would have allowed
a heavy callback load to delay an expedited grace period.  So this will
be a bit of a hack until I get all the RCU implementations converged,
but there is a nice long-term solution to be had by integrating the
expediting into the hierarchical-RCU data structures.

							Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help