RE: v3.18-RT

From: Carol Wong <hidden>
Date: 2016-09-20 18:27:19

Hi Sebastian,

You wrote:

One thing on the bisect. The git tree has the patches in this order:
 (1) kernel: migrate_disable() do fastpath in atomic & irqs-off
 (2) kernel: softirq: unlock with irqs on

but you need apply Patch #2 before #1. So if you bisect and you hit
warnings due to #1 please note that need apply #2.

T01 and T02 show probably the same issue but there are too many
warnings comming in parallel. If this comes from the sched patch due
#1/#2 mix up then don't bisect here or have them both applied.
The call path itself does look special as it would violate the rule
of atomic locking / unlocking (as it was fixed in #2 for instance).
At this point I assume that your bisect went wrong due to patch
#1/#2.

The traces were produced using the original 3.18.29-rt30 kernel (with all patches) plus the addition of 
WARN_ON_ONCE(p->migrate_disable_atomic <= 0) in migrate_enable() and CONFIG_SCHED_DEBUG=y.

When I revert only patch #1, from the 3.18.29-rt30 kernel, the kernel never crashes. I've been performing long-running tests on a dual Xeon system and a quad-core i7 system with patch #1 reverted.

Cheers,
Carol

-----Original Message-----
From: Sebastian Andrzej Siewior [mailto:bigeasy@linutronix.de]
Sent: Thursday, September 08, 2016 6:45 AM
To: Carol Wong
Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck
Subject: Re: v3.18-RT

On 2016-08-19 00:41:46 [+0000], Carol Wong wrote:

quoted

Hi Sebastian,

Hi Carol,

quoted

Were you able to gain any insight from the traces?

not really. T00 shows a fault in
[    2.756284] BUG: unable to handle kernel NULL pointer dereference
at 00000004
[    2.756289] IP: [<c11653e7>] kmem_cache_alloc+0x87/0x230
from ida_pre_get() / create_worker(). That is quite late so I have no
idea why that would happen.
The other two are not really help full.

quoted

If we were to proceed with reverting the kernel/sched/core.c patch

in our build of 3.18.29-rt30, would the addition of the
WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you
recommended (2016/07/29) be sufficient for detecting imbalances? We
would perform extended testing on multiple systems to determine the
effects of reverting the patch.

One thing on the bisect. The git tree has the patches in this order:
 (1) kernel: migrate_disable() do fastpath in atomic & irqs-off
 (2) kernel: softirq: unlock with irqs on

but you need apply Patch #2 before #1. So if you bisect and you hit
warnings due to #1 please note that need apply #2.

T01 and T02 show probably the same issue but there are too many
warnings comming in parallel. If this comes from the sched patch due
#1/#2 mix up then don't bisect here or have them both applied.
The call path itself does look special as it would violate the rule
of atomic locking / unlocking (as it was fixed in #2 for instance).
At this point I assume that your bisect went wrong due to patch
#1/#2.

quoted

Cheers,
Carol

Sebastian

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help