RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?
From: Jonathan.Cameron@huawei.com (Jonathan Cameron)
Date: 2017-07-26 15:33:40
Also in:
linuxppc-dev, sparclinux
On Wed, 26 Jul 2017 15:23:15 +0100 Jonathan Cameron [off-list ref] wrote:
On Wed, 26 Jul 2017 07:14:17 -0700 "Paul E. McKenney" [off-list ref] wrote:quoted
On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:quoted
On Wed, 26 Jul 2017 10:32:32 +0100 Jonathan Cameron [off-list ref] wrote:quoted
On Wed, 26 Jul 2017 09:16:23 +0100 Jonathan Cameron [off-list ref] wrote:quoted
On Tue, 25 Jul 2017 21:12:17 -0700 "Paul E. McKenney" [off-list ref] wrote:quoted
On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:quoted
From: "Paul E. McKenney" <redacted> Date: Tue, 25 Jul 2017 20:55:45 -0700quoted
On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:quoted
Just to report, turning softlockup back on fixes things for me on sparc64 too.Very good!quoted
The thing about softlockup is it runs an hrtimer, which seems to run about every 4 seconds.I could see where that could shake things loose, but I am surprised that it would be needed. I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y with no trouble, but I will be running a longer test later on.quoted
So I wonder if this is a NO_HZ problem.Might be. My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y. What are you running? (Again, my symptoms are slightly different, so I might be seeing a different bug.)I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you. To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.Same here -- but my failure case happens fairly rarely, so it will take some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR had effect. But you are right, might be interesting to try NO_HZ_PERIODIC=y or NO_HZ_FULL=y. So many possible tests, and so little time. ;-) Thanx, PaulI'll be the headless chicken running around and trying as many tests as I can fit in. Typical time to see the failure for us is sub 10 minutes so we'll see how far we get. Make me a list to run if you like ;) NO_HZ_PERIODIC=y running now.By which I mean CONFIG_HZ_PERIODIC=yI did get that messed up, didn't I? Sorry for my confusion!quoted
quoted
Anyhow, run for 40 minutes with out seeing a splat but my sanity check on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so I won't have much confidence until we are a few hours in on this. Anyhow, certainly looking like a promising direction for investigation!Well it's done over 3 hours without a splat so I think it is fine with CONFIG_HZ_PERIODIC=yThank you! If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal user task waking up every few seconds on each CPU, does the problem occur? (The question is whether any disturbance gets things going, or whether there is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y. Dave, any other ideas on what might be causing this or what might be tested? Thanx, PaulAlthough it's still early days (40 mins in) it looks like the issue first occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs) Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.
Didn't leave it long enough. Still bad on 4.10-rc7 just took over an hour to occur.
Could be something different was hiding it in 4.10 though. We have a fair delta from mainline back then unfortunately so bisecting will be 'interesting'. I'll see if I can get the test you suggest running. Jonathan _______________________________________________ linuxarm mailing list linuxarm at huawei.com http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm