Thread (49 messages) 49 messages, 5 authors, 2017-08-01

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: 2017-07-26 12:49:43
Also in: linux-arm-kernel, sparclinux

Possibly related (same subject, not in this thread)

On Wed, 26 Jul 2017 13:28:01 +0100
Jonathan Cameron [off-list ref] wrote:
On Wed, 26 Jul 2017 10:32:32 +0100
Jonathan Cameron [off-list ref] wrote:
quoted
On Wed, 26 Jul 2017 09:16:23 +0100
Jonathan Cameron [off-list ref] wrote:
  
quoted
On Tue, 25 Jul 2017 21:12:17 -0700
"Paul E. McKenney" [off-list ref] wrote:
    
quoted
On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:      
quoted
From: "Paul E. McKenney" <redacted>
Date: Tue, 25 Jul 2017 20:55:45 -0700
        
quoted
On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:        
quoted
Just to report, turning softlockup back on fixes things for me on
sparc64 too.        
Very good!
        
quoted
The thing about softlockup is it runs an hrtimer, which seems to run
about every 4 seconds.        
I could see where that could shake things loose, but I am surprised that
it would be needed.  I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y
with no trouble, but I will be running a longer test later on.
        
quoted
So I wonder if this is a NO_HZ problem.        
Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
you running?  (Again, my symptoms are slightly different, so I might
be seeing a different bug.)        
I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.

To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.        
Same here -- but my failure case happens fairly rarely, so it will take
some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
had effect.

But you are right, might be interesting to try NO_HZ_PERIODIC=y
or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)

							Thanx, Paul
      
I'll be the headless chicken running around and trying as many tests
as I can fit in.  Typical time to see the failure for us is sub 10
minutes so we'll see how far we get.

Make me a list to run if you like ;)

NO_HZ_PERIODIC=y running now.    
By which I mean CONFIG_HZ_PERIODIC=y

Anyhow, run for 40 minutes with out seeing a splat but my sanity check
on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
I won't have much confidence until we are a few hours in on this.

Anyhow, certainly looking like a promising direction for investigation!
  
Well it's done over 3 hours without a splat so I think it is fine with
CONFIG_HZ_PERIODIC=y
As I think we expected, the problem occurs with NO_HZ_FULL.
Happened pretty quickly but given the somewhat random nature,
might just be coincidence.

Jonathan
quoted
Jonathan
  
quoted
Jonathan

_______________________________________________
linuxarm mailing list
linuxarm@huawei.com
http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm    

_______________________________________________
linuxarm mailing list
linuxarm@huawei.com
http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  

_______________________________________________
linuxarm mailing list
linuxarm@huawei.com
http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help