RE: v3.18-RT
From: Carol Wong <hidden>
Date: 2016-08-19 00:49:01
Hi Sebastian, Were you able to gain any insight from the traces? If we were to proceed with reverting the kernel/sched/core.c patch in our build of 3.18.29-rt30, would the addition of the WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you recommended (2016/07/29) be sufficient for detecting imbalances? We would perform extended testing on multiple systems to determine the effects of reverting the patch. Cheers, Carol
-----Original Message----- From: Carol Wong Sent: Wednesday, August 03, 2016 6:32 PM To: 'Sebastian Andrzej Siewior' Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck Subject: RE: v3.18-RT Hi Sebastian, I made the suggested change to sched/core.c and verified that CONFIG_SCHED_DEBUG=y. I reproduced the crash 3 times and captured the attached traces. Thanks, Carolquoted
-----Original Message----- From: Sebastian Andrzej Siewior [mailto:bigeasy@linutronix.de] Sent: Friday, July 29, 2016 9:20 AM To: Carol Wong Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck Subject: Re: v3.18-RT * Carol Wong | 2016-07-20 20:53:21 [+0000]:quoted
Hi Sebastian,Hi Carol,quoted
We finally traced the boot-up crash to the following patch inkernel/sched/core.c:rt.git/comquoted
mit/?h=v3.18-rt&id=62044e554f14547061afcfef7f0aceda43e28982 After reverting the two-line patch in 3.18.29-rt30, the crash nolonger occurs on our dual Xeon (2x12 core) system.quoted
Other observations: - Does not reproduce on single processor (2 and 4 core) systems - Reproduces under 3.18.27-rt27 and 3.18.36-rt38 on the dual Xeon - Does not reproduce on 3.18.27-rt26 and earlier on the dual Xeon - Reproduces more frequently on .29-rt30 (1 in 20 reboots)comparedquoted
toquoted
.27-rt27 (1 in 100 reboots) So far we've not observed any side effects after reverting thispatch. This was part of CPU hotplug fixups. Lockdep might be brokenwithoutquoted
it but I am not sure if is most of the time the case or just during hotplug.quoted
I understand that a high core count system may not be easy to comeby, so if there are diagnostics or patches you would like to try on the dual Xeon system, we can assist with that. With that patch, migrate_disable() skips the whole preempt-lazy + pin-cpu code if called with IRQs off. Since interrupts are disabledwequoted
can't migrate to another so it is a possible optimsation. It only makes a difference if migrate_disable() + migrate_enable() calls are not in balance. The commit https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable- rt.git/commit/?h=v3.18-rt&id=8d51d3a296b6ec4aebd0d6d7e1b7162cd9bf6662quoted
is one example where I fixed the inbalance. Do you get additional backtraces with CONFIG_SCHED_DEBUG enabled? There is one thing the debug code does not cover, so could youpleasequoted
add this chunk?diff --git a/kernel/sched/core.c b/kernel/sched/core.c index140ee06079b6..1f8613f77598 100644--- a/kernel/sched/core.c +++ b/kernel/sched/core.c@@ -3229,6 +3229,7 @@ void migrate_enable(void) if (in_atomic() || irqs_disabled()) { #ifdefCONFIG_SCHED_DEBUGquoted
+ WARN_ON_ONCE(p->migrate_disable_atomic <= 0); p->migrate_disable_atomic--; #endif return;quoted
Cheers, Carol Wong NetAcquire CorporationSebastian