Thread (81 messages) 81 messages, 7 authors, 2017-09-06

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

From: npiggin@gmail.com (Nicholas Piggin)
Date: 2017-08-21 00:52:58
Also in: linuxppc-dev, sparclinux

On Sun, 20 Aug 2017 14:14:29 -0700
"Paul E. McKenney" [off-list ref] wrote:
On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
quoted
On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
quoted
On Sun, 20 Aug 2017 14:45:53 +1000
Nicholas Piggin [off-list ref] wrote:
  
quoted
On Wed, 16 Aug 2017 09:27:31 -0700
"Paul E. McKenney" [off-list ref] wrote:  
quoted
On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:

Thomas, John, am I misinterpreting the timer trace event messages?    
So I did some digging, and what you find is that rcu_sched seems to do a
simple scheudle_timeout(1) and just goes out to lunch for many seconds.
The process_timeout timer never fires (when it finally does wake after
one of these events, it usually removes the timer with del_timer_sync).

So this patch seems to fix it. Testing, comments welcome.  
Okay this had a problem of trying to forward the timer from a timer
callback function.

This was my other approach which also fixes the RCU warnings, but it's
a little more complex. I reworked it a bit so the mod_timer fast path
hopefully doesn't have much more overhead (actually by reading jiffies
only when needed, it probably saves a load).  
Giving this one a whirl!  
No joy here, but then again there are other reasons to believe that I
am seeing a different bug than Dave and Jonathan are.

OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
not statistically different from what I see without either patch.

But no statistical difference compared to without patch, and I still
see the "rcu_sched kthread starved" messages.  For whatever it is worth,
by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
Hmmm...  I am also seeing that without any of your patches.  Might
be hypervisor preemption, I guess.
Okay it makes the warnings go away for me, but I'm just booting then
leaving the system idle. You're doing some CPU hotplug activity?

Thanks,
Nick
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help