RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? | linux-arm-kernel

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-25
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-27
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Boqun Feng <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Boqun Feng <hidden> · 2017-07-30
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-30
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Boqun Feng <hidden> · 2017-07-29
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-28
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-31
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-31
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-31
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-01
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-02
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-15
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-16
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · mpe@ellerman.id.au (Michael Ellerman) · 2017-08-16
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-16
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-16
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-16
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · mpe@ellerman.id.au (Michael Ellerman) · 2017-08-17
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-20
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · npiggin@gmail.com (Nicholas Piggin) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-08-21
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-22
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Abdul Haleem <hidden> · 2017-08-22
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-22
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-09-06
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-08-22
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-31
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-31
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-08-01
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · davem@davemloft.net (David Miller) · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Paul E. McKenney <hidden> · 2017-07-26
RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this? · Jonathan.Cameron@huawei.com (Jonathan Cameron) · 2017-07-26

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

From: Paul E. McKenney <hidden>
Date: 2017-08-01 18:46:46
Also in: linuxppc-dev, sparclinux

On Mon, Jul 31, 2017 at 04:27:57PM +0100, Jonathan Cameron wrote:

On Mon, 31 Jul 2017 08:04:11 -0700
"Paul E. McKenney" [off-list ref] wrote:

quoted

On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:

quoted

On Fri, 28 Jul 2017 12:03:50 -0700
"Paul E. McKenney" [off-list ref] wrote:

quoted

On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:

quoted

On Fri, 28 Jul 2017 09:55:29 -0700
"Paul E. McKenney" [off-list ref] wrote:

quoted

On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:

quoted

On Fri, 28 Jul 2017 08:44:11 +0100
Jonathan Cameron [off-list ref] wrote:

[ . . . ]

quoted

Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of having
fewer timer events.  Issue became much easier to trigger (on some runs before
I could get tracing up and running)
e
So logs are large enough that pastebin doesn't like them - please shoet if

quoted

e another timer period is of interest.

https://pastebin.com/iUZDfQGM for the timer trace.
https://pastebin.com/3w1F7amH for dmesg.  

The relevant timeout on the RCU stall detector was 8 seconds.  Event is
detected around 835.

It's a lot of logs, so I haven't identified a smoking gun yet but there
may well be one in there.

The dmesg says:

rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

So I look for "rcu_preempt" timer events and find these:

rcu_preempt-9     [019] ....   827.579114: timer_init: timer=ffff8017d5fc7da0
rcu_preempt-9     [019] d..1   827.579115: timer_start: timer=ffff8017d5fc7da0 function=process_timeout 

Next look for "ffff8017d5fc7da0" and I don't find anything else.

It does show up off the bottom of what would fit in pastebin...

     rcu_preempt-9     [001] d..1   837.681077: timer_cancel: timer=ffff8017d5fc7da0
     rcu_preempt-9     [001] ....   837.681086: timer_init: timer=ffff8017d5fc7da0
     rcu_preempt-9     [001] d..1   837.681087: timer_start: timer=ffff8017d5fc7da0 function=process_timeout expires=4295101298 [timeout=1] cpu=1 idx=0 flags=

Odd.  I would expect an expiration...  And ten seconds is way longer
than the requested one jiffy!

quoted

The timeout was one jiffy, and more than a second later, no expiration.
Is it possible that this event was lost?  I am not seeing any sign of
this is the trace.

I don't see any sign of CPU hotplug (and I test with lots of that in
any case).

The last time we saw something like this it was a timer HW/driver problem,
but it is a bit hard to imagine such a problem affecting both ARM64
and SPARC.  ;-)

Could be different issues, both of which were hidden by that lockup detector.

There is an errata work around for the timers on this particular board.
I'm only vaguely aware of it, so may be unconnected.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2&id=bb42ca47401010fc02901b5e8f79e40a26f208cb

Seems unlikely though! + we've not yet seen it on the other chips that
errata effects (not that that means much).

If you can reproduce quickly, might be worth trying anyway...

							Thanx, Paul

Errata fix is running already and was for all those tests.

I was afraid of that...  ;-)

It's a pretty rare errata it seems.  Not actually managed to catch
one yet.

quoted

I'll have a dig into the timers today and see where I get to.

Look forward to seeing what you find!

Nothing obvious turning up other than we don't seem to have issue
when we aren't running hrtimers.

On a plus side I just got a report that it is effecting our d03
boards which is good on the basis I couldn't tell what the difference
could be wrt to this issue!

It indeed looks like we are consistently missing a timer before
the rcu splat occurs.

And for my part, my tests with CONFIG_HZ_PERIODIC=y and
CONFIG_RCU_FAST_NO_HZ=n showed roughly the same failure rate
as other runs.

Missing a timer can most certainly give RCU severe heartburn!  ;-)
Do you have what you need to track down the missing timer?  

							Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help