Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

From: Nicholas Piggin <npiggin@gmail.com>
Date: 2019-03-30 03:12:26
Subsystem: the rest · Maintainer: Linus Torvalds

Steven Rostedt's on March 30, 2019 1:31 am:

[ Added Peter ]

On Fri, 29 Mar 2019 19:13:55 +1000
Nicholas Piggin [off-list ref] wrote:

quoted

Suraj Jitindar Singh's on March 29, 2019 3:20 pm:

quoted

On Wed, 2019-03-27 at 17:51 +0100, Cédric Le Goater wrote:

quoted

On 3/27/19 5:37 PM, Cédric Le Goater wrote:

quoted

On 3/27/19 1:36 PM, Sebastian Andrzej Siewior wrote:

quoted

With qemu-system-ppc64le -machine pseries -smp 4 I get:

quoted

#  chrt 1 hackbench
Running in process mode with 10 groups using 40 file
descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Oops: Exception in kernel mode, sig: 4 [#1]
LE PAGE_SIZE=64K MMU=Hash PREEMPT SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 0 PID: 629 Comm: hackbench Not tainted 5.1.0-rc2 #71
NIP:  c000000000046978 LR: c000000000046a38 CTR:
c0000000000b0150
REGS: c0000001fffeb8e0 TRAP: 0700   Not tainted  (5.1.0-rc2)
MSR:  8000000000089033 <SF,EE,ME,IR,DR,RI,LE>  CR:
42000874  XER: 00000000
CFAR: c000000000046a34 IRQMASK: 1
GPR00: c0000000000b0170 c0000001fffebb70 c000000000a6ba00
0000000028000000

…

quoted

NIP [c000000000046978] doorbell_core_ipi+0x28/0x30
LR [c000000000046a38] doorbell_try_core_ipi+0xb8/0xf0
Call Trace:
[c0000001fffebb70] [c0000001fffebba0] 0xc0000001fffebba0
(unreliable)
[c0000001fffebba0] [c0000000000b0170]
smp_pseries_cause_ipi+0x20/0x70
[c0000001fffebbd0] [c00000000004b02c]
arch_send_call_function_single_ipi+0x8c/0xa0
[c0000001fffebbf0] [c0000000001de600]
irq_work_queue_on+0xe0/0x130
[c0000001fffebc30] [c0000000001340c8]
rto_push_irq_work_func+0xc8/0x120

…

quoted

Instruction dump:
60000000 60000000 3c4c00a2 384250b0 3d220009 392949c8 81290000
3929ffff
7d231838 7c0004ac 5463017e 64632800 <7c00191c> 4e800020
3c4c00a2 38425080
---[ end trace eb842b544538cbdf ]---

This is unusual and causing powerpc code to crash because the rt
scheduler is telling irq_work_queue_on to queue work on this CPU.
Is that something allowed? There's no warnings in there but it must
be a rarely tested path, would it be better to ban it?

Steven is this queue_work_on to self by design?

I just figured that a irq_work_queue_on() would default to just
irq_work_queue() if cpu == smp_processor_id().

Yeah it actually doesn't, it raises a CALL_FUNCTION IPI on the
current CPU. Maybe it should do that.

I'm sure this case has been tested (on x86), as I stressed this quite a
bit, and all it takes is to have an RT task queued on the CPU
processing the current "push" logic when it's not holding the rq locks
and read that the current CPU now has an overloaded RT task.

If calling irq_work_queue_on() is really bad to do to the same CPU,
then we can work on changing the logic to do something different if the
CPU returned from rto_next_cpu() is the current CPU.

I don't actually know that it is bad. It does seem to violate the
principle of least surprise for arch code though.

From this bug report, I'm guessing it's pretty rare case to hit so
I would worry about subtle bugs. And I just cc'ed you in case your
code was actually not intending to do this.

Or would it be possible to just change irq_work_queue_on() to call
irq_work_queue() if cpu == smp_processor_id()?

Something like this?

kernel/irq_work: Do not raise an IPI when queueing work on the local CPU

The QEMU powerpc/pseries machine model was not expecting a self-IPI,
and it may be a bit surprising thing to do, so have irq_work_queue_on
do local queueing when target is current CPU.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/irq_work.c | 78 ++++++++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 6b7cdf17ccf8..f0e539d0f879 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c

@@ -56,61 +56,69 @@ void __weak arch_irq_work_raise(void)
 	 */
 }
 
-/*
- * Enqueue the irq_work @work on @cpu unless it's already pending
- * somewhere.
- *
- * Can be re-enqueued while the callback is still in progress.
- */
-bool irq_work_queue_on(struct irq_work *work, int cpu)
+/* Enqueue on current CPU, work must already be claimed and preempt disabled */
+static void __irq_work_queue(struct irq_work *work)
 {
-	/* All work should have been flushed before going offline */
-	WARN_ON_ONCE(cpu_is_offline(cpu));
-
-#ifdef CONFIG_SMP
-
-	/* Arch remote IPI send/receive backend aren't NMI safe */
-	WARN_ON_ONCE(in_nmi());
+	/* If the work is "lazy", handle it from next tick if any */
+	if (work->flags & IRQ_WORK_LAZY) {
+		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
+		    tick_nohz_tick_stopped())
+			arch_irq_work_raise();
+	} else {
+		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+			arch_irq_work_raise();
+	}
+}
 
+/* Enqueue the irq work @work on the current CPU */
+bool irq_work_queue(struct irq_work *work)
+{
 	/* Only queue if not already pending */
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
-		arch_send_call_function_single_ipi(cpu);
-
-#else /* #ifdef CONFIG_SMP */
-	irq_work_queue(work);
-#endif /* #else #ifdef CONFIG_SMP */
+	/* Queue the entry and raise the IPI if needed. */
+	preempt_disable();
+	__irq_work_queue(work);
+	preempt_enable();
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(irq_work_queue);
 
-/* Enqueue the irq work @work on the current CPU */
-bool irq_work_queue(struct irq_work *work)
+/*
+ * Enqueue the irq_work @work on @cpu unless it's already pending
+ * somewhere.
+ *
+ * Can be re-enqueued while the callback is still in progress.
+ */
+bool irq_work_queue_on(struct irq_work *work, int cpu)
 {
+#ifndef CONFIG_SMP
+	return irq_work_queue(work);
+
+#else /* #ifndef CONFIG_SMP */
+	/* All work should have been flushed before going offline */
+	WARN_ON_ONCE(cpu_is_offline(cpu));
+
 	/* Only queue if not already pending */
 	if (!irq_work_claim(work))
 		return false;
 
-	/* Queue the entry and raise the IPI if needed. */
 	preempt_disable();
-
-	/* If the work is "lazy", handle it from next tick if any */
-	if (work->flags & IRQ_WORK_LAZY) {
-		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
-			arch_irq_work_raise();
-	}
-
+	if (cpu != smp_processor_id()) {
+		/* Arch remote IPI send/receive backend aren't NMI safe */
+		WARN_ON_ONCE(in_nmi());
+		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+			arch_send_call_function_single_ipi(cpu);
+	} else
+		__irq_work_queue(work);
 	preempt_enable();
 
 	return true;
+#endif /* #else #ifndef CONFIG_SMP */
 }
-EXPORT_SYMBOL_GPL(irq_work_queue);
+
 
 bool irq_work_needs_cpu(void)
 {

-- 
2.20.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help