Thread (81 messages) 81 messages, 7 authors, 2017-09-06

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

From: Jonathan.Cameron@huawei.com (Jonathan Cameron)
Date: 2017-07-28 13:24:03
Also in: linuxppc-dev, sparclinux

On Fri, 28 Jul 2017 08:44:11 +0100
Jonathan Cameron [off-list ref] wrote:
On Thu, 27 Jul 2017 09:52:45 -0700
"Paul E. McKenney" [off-list ref] wrote:
quoted
On Thu, Jul 27, 2017 at 05:39:23PM +0100, Jonathan Cameron wrote:  
quoted
On Thu, 27 Jul 2017 14:49:03 +0100
Jonathan Cameron [off-list ref] wrote:
    
quoted
On Thu, 27 Jul 2017 05:49:13 -0700
"Paul E. McKenney" [off-list ref] wrote:
    
quoted
On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote:      
quoted
On Wed, 26 Jul 2017 18:42:14 -0700
"Paul E. McKenney" [off-list ref] wrote:
        
quoted
On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:        
        
quoted
quoted
Indeed, that really wouldn't explain how we end up with a RCU stall
dump listing almost all of the cpus as having missed a grace period.          
I have seen stranger things, but admittedly not often.        
So the backtraces show the RCU gp thread in schedule_timeout.

Are you sure that it's timeout has expired and it's not being scheduled,
or could it be a bad (large) timeout (looks unlikely) or that it's being
scheduled but not correctly noting gps on other CPUs?

It's not in R state, so if it's not being scheduled at all, then it's
because the timer has not fired:        
Good point, Nick!

Jonathan, could you please reproduce collecting timer event tracing?      
I'm a little new to tracing (only started playing with it last week)
so fingers crossed I've set it up right.  No splats yet.  Was getting
splats on reading out the trace when running with the RCU stall timer
set to 4 so have increased that back to the default and am rerunning.

This may take a while.  Correct me if I've gotten this wrong to save time

echo "timer:*" > /sys/kernel/debug/tracing/set_event

when it dumps, just send you the relevant part of what is in
/sys/kernel/debug/tracing/trace?    
Interestingly the only thing that can make trip for me with tracing on
is peaking in the tracing buffers.  Not sure this is a valid case or
not.

Anyhow all timer activity seems to stop around the area of interest.
Firstly sorry to those who got the rather silly length email a minute ago.
It bounced on the list (fair enough - I was just being lazy on getting
data past our firewalls).

Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of having
fewer timer events.  Issue became much easier to trigger (on some runs before
I could get tracing up and running)

So logs are large enough that pastebin doesn't like them - please shout if
another timer period is of interest.

https://pastebin.com/iUZDfQGM for the timer trace.
https://pastebin.com/3w1F7amH for dmesg.  

The relevant timeout on the RCU stall detector was 8 seconds.  Event is
detected around 835.

It's a lot of logs, so I haven't identified a smoking gun yet but there
may well be one in there.

Jonathan
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help