Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
From: Rich Felker <hidden>
Date: 2016-10-02 00:01:14
Also in:
linux-sh, lkml
On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:quoted
On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:quoted
On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:quoted
On Tue, 27 Sep 2016, Rich Felker wrote:quoted
I've managed to get a trace with a stall. I'm not sure what the best way to share the full thing is, since it's large, but here are the potentially interesting parts.[ . . . ] Some RCU commentary, on the off-chance that it helps...quoted
So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes into a NOHZ idle sleep.quoted
<idle>-0 [001] d... 109.953436: tick_stop: success=1 dependency=NONE <idle>-0 [001] d... 109.953617: hrtimer_cancel: hrtimer=109f449c <idle>-0 [001] d... 109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000which is (using the 0.087621us delta between the trace clock and clock MONO) at: 109.880 + 0.087621 = 109.968 Which is about correct as we expect the RCU timer to fire at: 109.952633 + 0.01 = 109.963633 or latest at 109.952633 + 0.02 = 109.983633 There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which it did not because:Just for annoying completeness, the location of the timer depends on how the rcuo callback-offload kthreads are constrained. And yes, in the most constrained case where all CPUs except for CPU 0 are nohz CPUs, they will by default all run on CPU 0.In default full nohz configuration, am I correct that all cpus except cpu0 willd be nohz and that the rcu callbacks then have to run on cpu0?In recent kernels, it looks to me that the default is that none of the CPUs will be nohz by default. You have to build with both NO_HZ_FULL and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.
Actually NO_HZ_FULL is not supported because HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so it's just normal NO_HZ. I listed the relevant config items somewhere earlier in this thread I think.
But are there rcu0 kthreads running on your system?
Apparently not:
# ps aux | grep rcu
7 root 0:00 [rcu_sched]
8 root 0:00 [rcu_bh]
395 root 0:00 grep rcu
quoted
quoted
quoted
quoted
rcu_sched-7 [001] d... 109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001The CPU nr encoded in flags is: 1 Now we cancel and restart the timer w/o seing the interrupt expiring it. And that expiry should have happened at 109.968000 !?!quoted
<idle>-0 [001] d... 109.968225: hrtimer_cancel: hrtimer=109f449c <idle>-0 [001] d... 109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000So this advances the next tick even further out. And CPU 0 sets the timer to the exact smae value:quoted
<idle>-0 [000] d.h. 109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000quoted
<idle>-0 [000] d.h. 109.977690: irq_handler_entry: irq=16 name=jcore_pit <idle>-0 [000] d.h. 109.977911: hrtimer_cancel: hrtimer=109e949c <idle>-0 [000] d.h. 109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160Which expires here. And CPU1 instead of getting an interrupt and expiring the timer does the cancel/restart to the next jiffie again:quoted
<idle>-0 [001] d... 109.978206: hrtimer_cancel: hrtimer=109f449c <idle>-0 [001] d... 109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000And this repeats;quoted
<idle>-0 [000] d.h. 109.987726: irq_handler_entry: irq=16 name=jcore_pit <idle>-0 [000] d.h. 109.987954: hrtimer_cancel: hrtimer=109e949c <idle>-0 [000] d.h. 109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620quoted
<idle>-0 [001] d... 109.988243: hrtimer_cancel: hrtimer=109f449c <idle>-0 [001] d... 109.988537: hrtimer_start: hrtimer=109f449c fun9cThere is something badly wrong here.quoted
<idle>-0 [000] ..s. 110.019443: softirq_entry: vec=1 [action=TIMER] <idle>-0 [000] ..s. 110.019617: softirq_exit: vec=1 [action=TIMER] <idle>-0 [000] ..s. 110.019730: softirq_entry: vec=7 [action=SCHED] <idle>-0 [000] ..s. 110.020174: softirq_exit: vec=7 [action=SCHED] <idle>-0 [000] d.h. 110.027674: irq_handler_entry: irq=16 name=jcore_pit The rcu_sched process does not run again after the tick_stop until 132s, and only a few RCU softirqs happen (all shown above). During this time, cpu1 has no interrupt activity and nothing in the trace except the above hrtimer_cancel/hrtimer_start pairs (not sure how they're happening without any interrupts).If you drop out of the arch idle into the core idle loop then you might end up with this. You want to add a few trace points or trace_printks() to the involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick() tick_nohz_restart_sched_tick() and the idle code should be a good starting point.quoted
This pattern repeats until almost 131s, where cpu1 goes into a frenzy of hrtimer_cancel/start:It's not a frenzy. It's the same pattern as above. It arms the timer to the next tick, but that timer never ever fires. And it does that every tick .... Please put a tracepoint into your set_next_event() callback as well. SO this changes here:quoted
<idle>-0 [001] d... 132.198170: hrtimer_cancel: hrtimer=109f449c <idle>-0 [001] d... 132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000quoted
<idle>-0 [001] dnh. 132.205860: irq_handler_entry: irq=20 name=ipi <idle>-0 [001] dnh. 132.206041: irq_handler_exit: irq=20 ret=handleSo CPU1 gets an IPIquoted
<idle>-0 [001] dn.. 132.206650: hrtimer_cancel: hrtimer=109f449c49c function=tick_sched_timer now=132119115200quoted
<idle>-0 [001] dn.. 132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000And rcu-sched-7 gets running magically, but we don't know what woke it up. Definitely not the timer, because that did not fire.quoted
rcu_sched-7 [001] d... 132.207710: timer_cancel: timer=160a9eb0It could have been an explicit wakeup at the end of a grace period. That would explain its cancelling the timer, anyway.I think the rcu stall handler kicked it, no? Looking at the code again, maybe that behavior needs to be explicitly turned on, so maybe it's just the uart interrupt activity/load from the stall message that breaks the stall condition.That is indeed another alternative. But either way, if your timers aren't waking rcu_sched up when it asks to be awakened, you will very likely get RCU CPU stall warning messages. And on small systems, rcu_sched asks to be awakened every few jiffies by default.
Can you clarify which timer you think should be waking it up and isn't? Maybe I'm mistaken but it looks to me like rcu_gp_kthread() is waking up as expected, only to go back to sleep due to failure of rcu_gp_fqs_check_wake to end the loop. To demonstrate this to myself, I hacked rcu_gp_fqs_check_wake to show 2 low bits of rnp->qsmask on LEDs. I also put a hook to update them in cpu_idle_loop just now, just in case. At least one, and usually two, LEDs are on when the stalls happen.
quoted
quoted
quoted
quoted
- During the whole sequence, hrtimer expiration times are being set to exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out than the next tick, then tick_sched_timer is set to this next event which can be far out. So that's expected.quoted
- The CLOCK_MONOTONIC hrtimer times do not match up with the timestamps; they're off by about 0.087s. I assume this is just sched_clock vs clocksource time and not a big deal.Yes. You can tell the tracer to use clock monotonic so then they should match.quoted
- The rcu_sched process is sleeping with timeout=1. This seems odd/excessive.Why is that odd? That's one tick, i.e. 10ms in your case. And that's not the problem at all. The problem is your timer not firing, but the cpu is obviously either getting out of idle and then moves the tick ahead for some unknown reason.And a one-jiffy timeout is in fact expected behavior when HZ=100. You have to be running HZ=250 or better to have two-jiffy timeouts, and HZ=500 or better for three-jiffy timeouts.One possible theory I'm looking at is that the two cpus are both waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy with sufficient consistency that every time the rcu_gp_fqs_check_wake loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but outside the rcu_idle_enter/rcu_idle_exit range. Would this block forward process? I added an LED indicator in rcu_gp_fqs_check_wake that shows the low 2 bits of rnp->qsmask every time it's called, and under normal operation the LEDs just flash on momentarily or just one stays on for a few seconds then goes off. During a stall both are stuck on. I'm still trying to make sense of the code but my impression so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits correspond directly to cpus; is that right? If so I'm a bit confused because I don't see how forward progress could ever happen if the cpu on which rcu_gp_kthread is blocking forward progress of rcu_gp_kthread.No. If the CPUs are entering and leaving idle, and if your timers were waking up rcu_sched every few jiffies like it asks, then the repeated idle entry/exit events would be noticed, courtesy of the atomic increments of ->dynticks and the rcu_sched kthread's snapshotting and checking of this value.
I don't see how rcu_sched could notice the changes if it's stuck in the wait loop I think it's stuck in. There is no check of ->dynticks in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS accomplishes this, I updated my LED hacks to clear the LEDs in that exit path (and killed the other place that could turn them back on from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a stall message.
Even if the CPUs were always non-idle on every time force_quiescent_state() is invoked, give or take the possibility of counter wrap -- but even on a 32-bit system, that takes awhile.
Perhaps force_quiescent_state is not getting invoked? Does that sound plausible, and if so, how should I go about debugging it? I see it is called from the stall reporting code, so that's presumably what's breaking the stalls.
But if your timers don't wake up rcu_sched, then force_quiescent_state() will never be called, and the ->dynticks counter will be neither snapshotted nor checked, which will mean that the idle CPUs will never be noticed, which might mean that the grace period never completes. Which will definitely result in an RCU CPU stall warning like the one that you are seeing.
I don't follow. I thought force_quiescent_state was called by other code to break rcu_sched out of the loop I'm finding it stuck in, not by rcu_sched itself. Rich -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html