Re: [PATCH RFC] v2 expedited "big hammer" RCU grace periods
From: Paul E. McKenney <hidden>
Date: 2009-04-27 16:16:49
Also in:
lkml, netfilter-devel
On Mon, Apr 27, 2009 at 11:54:24AM -0400, Mathieu Desnoyers wrote:
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:quoted
On Mon, Apr 27, 2009 at 05:26:39AM +0200, Ingo Molnar wrote:quoted
* Paul E. McKenney [off-list ref] wrote:quoted
On Sun, Apr 26, 2009 at 10:22:55PM +0200, Ingo Molnar wrote:quoted
* Mathieu Desnoyers [off-list ref] wrote:quoted
* Ingo Molnar (mingo@elte.hu) wrote:quoted
* Paul E. McKenney [off-list ref] wrote:quoted
Second cut of "big hammer" expedited RCU grace periods, but only for rcu_bh. This creates another softirq vector, so that entering this softirq vector will have forced an rcu_bh quiescent state (as noted by Dave Miller). Use smp_call_function() to invoke raise_softirq() on all CPUs in order to cause this to happen. Track the CPUs that have passed through a quiescent state (or gone offline) with a cpumask.hm, i'm still asking whether doing this would be simpler via a reschedule vector - which not only is an existing facility but also forces all RCU domains through a quiescent state - not just bh-RCU participants. Triggering a new softirq is in no way simpler that doing an SMP cross-call - in fact softirqs are a finite resource so using some other facility would be preferred. Am i missing something?I think the reason for this whole thread is that waiting for rcu quiescent state, when called many times e.g. in multiple iptables invokations, takes too longs (5 seconds to load the netfilter rules at boot). [...]I'm aware of the problem space. I was suggesting that to trigger the quiescent state and to wait for it to propagate it would be enough to reuse the reschedule mechanism. It would be relatively straightforward: first a send-reschedule then do a wait_task_context_switch() on rq->curr - both are existing primitives. (a task reference has to be taken but that's pretty much all)Well, one reason I didn't take this approach was that I didn't happen to think of it. ;-) Also that I hadn't heard of wait_task_context_switch(). Hmmm... Looking for wait_task_context_switch(). OK, found it. It looks to me that this primitive won't return until the scheduler actually decides to run something else. We instead need to have something that stops waiting once the CPU enters the scheduler, hence the previous thought of making rcu_qsctr_inc() do a bit of extra work. This would be a way of making an expedited RCU-sched across all RCU implementations. As noted in the earlier email, it would not handle RCU or RCU-bh in a -rt kernel.quoted
By the time wait_task_context_switch() returns from the last CPU we know that the quiescent state has passed.We would want to wait for all of the CPUs in parallel, though, wouldn't we? Seems that we would not want to wait for the last CPU to do another trip through the scheduler if it had already passed through the scheduler while we were waiting on the earlier CPUs. So it seems like we would still want a two-pass approach -- one pass to capture the current state, the second pass to wait for the state to change.I think waiting in parallel is still possible (first kick all tasks, then make sure all tasks have left the CPU at least once). The busy-waiting in wait_task_context_switch() is indeed a problem - but perhaps that could be refactored to be a migration-thread driven wait_for_completion() + complete() cycle? It could be driven by preempt notifiers perhaps - and become zero-cost.Hmmm... It would need to be informed of the quiescent state even if that quiescent state did not result in a preemption. But you are right -- I do need to expedite RCU, not just RCU-bh, especially given that the boot-speed guys are starting to see grace periods as a measureable fraction of the boot time. I will take another pass at this.It might sound a bit simplistic, but... scheduling a high-priority workqueue on every CPUs would give you the guarantees you seem to need here. Or is the delay of letting the scheduler schedule a high-priority task a delay you are trying to avoid ? Some kind of priority boosting done by synchronize_rcu() could probably work, and you could support rcu callbacks priority boosting by assigning a priority to each callback registered (same priority as the thread which invoked call_rcu). The rcu callbacks could then be sorted by priority in a RB tree, and only the callbacks associated with priorityquoted
= than the next priority task would be executed.
I did something similar for the implementation of synchronize_sched() in preemptable RCU. The interactions with CPU hotplug are a bit ugly. It will be easier to hook into rcu_qsctr_inc(). ;-) But this discussion has been quite useful -- my thoughts for the design of the long-term solution were a bit lacking, as they would have allowed a heavy callback load to delay an expedited grace period. So this will be a bit of a hack until I get all the RCU implementations converged, but there is a nice long-term solution to be had by integrating the expediting into the hierarchical-RCU data structures. Thanx, Paul