Thread (24 messages) 24 messages, 5 authors, 2020-11-25

Re: linux-next: stall warnings and deadlock on Arm64 (was: [PATCH] kfence: Avoid stalling...)

From: Mark Rutland <mark.rutland@arm.com>
Date: 2020-11-20 18:02:22
Also in: linux-mm, lkml, rcu

On Fri, Nov 20, 2020 at 09:38:24AM -0800, Paul E. McKenney wrote:
On Fri, Nov 20, 2020 at 03:22:00PM +0000, Mark Rutland wrote:
quoted
On Fri, Nov 20, 2020 at 06:39:28AM -0800, Paul E. McKenney wrote:
quoted
On Fri, Nov 20, 2020 at 03:19:28PM +0100, Marco Elver wrote:
quoted
I found that disabling ftrace for some of kernel/rcu (see below) solved
the stalls (and any mention of deadlocks as a side-effect I assume),
resulting in successful boot.

Does that provide any additional clues? I tried to narrow it down to 1-2
files, but that doesn't seem to work.
There were similar issues during the x86/entry work.  Are the ARM guys
doing arm64/entry work now?
I'm currently looking at it. I had been trying to shift things to C for
a while, and right now I'm trying to fix the lockdep state tracking,
which is requiring untangling lockdep/rcu/tracing.

The main issue I see remaining atm is that we don't save/restore the
lockdep state over exceptions taken from kernel to kernel. That could
result in lockdep thinking IRQs are disabled when they're actually
enabled (because code in the nested context might do a save/restore
while IRQs are disabled, then return to a context where IRQs are
enabled), but AFAICT shouldn't result in the inverse in most cases since
the non-NMI handlers all call lockdep_hardirqs_disabled().

I'm at a loss to explaim the rcu vs ftrace bits, so if you have any
pointers to the issuies ween with the x86 rework that'd be quite handy.
There were several over a number of months.  I especially recall issues
with the direct-from-idle execution of smp_call_function*() handlers,
and also with some of the special cases in the entry code, for example,
reentering the kernel from the kernel.  This latter could cause RCU to
not be watching when it should have been or vice versa.
Ah; those are precisely the cases I'm currently fixing, so if we're
lucky this is an indirect result of one of those rather than a novel
source of pain...
I would of course be most aware of the issues that impinged on RCU
and that were located by rcutorture.  This is actually not hard to run,
especially if the ARM bits in the scripting have managed to avoid bitrot.
The "modprobe rcutorture" approach has fewer dependencies.  Either way:
https://paulmck.livejournal.com/57769.html and later posts.
That is a very good idea. I'd been relying on Syzkaller to tickle the
issue, but the torture infrastructure is a much better fit for this
problem. I hadn't realise how comprehensive the scripting was, thanks
for this!

I'll see about giving that a go once I have the irq-from-idle cases
sorted, as those are very obviously broken if you hack
trace_hardirqs_{on,off}() to check that RCU is watching.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help