Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2019-03-30 03:12:26
Subsystem:
the rest · Maintainer:
Linus Torvalds
Steven Rostedt's on March 30, 2019 1:31 am:
[ Added Peter ] On Fri, 29 Mar 2019 19:13:55 +1000 Nicholas Piggin [off-list ref] wrote:quoted
Suraj Jitindar Singh's on March 29, 2019 3:20 pm:quoted
On Wed, 2019-03-27 at 17:51 +0100, Cédric Le Goater wrote:quoted
On 3/27/19 5:37 PM, Cédric Le Goater wrote:quoted
On 3/27/19 1:36 PM, Sebastian Andrzej Siewior wrote:quoted
With qemu-system-ppc64le -machine pseries -smp 4 I get:quoted
# chrt 1 hackbench Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 100 messages of 100 bytes Oops: Exception in kernel mode, sig: 4 [#1] LE PAGE_SIZE=64K MMU=Hash PREEMPT SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 629 Comm: hackbench Not tainted 5.1.0-rc2 #71 NIP: c000000000046978 LR: c000000000046a38 CTR: c0000000000b0150 REGS: c0000001fffeb8e0 TRAP: 0700 Not tainted (5.1.0-rc2) MSR: 8000000000089033 <SF,EE,ME,IR,DR,RI,LE> CR: 42000874 XER: 00000000 CFAR: c000000000046a34 IRQMASK: 1 GPR00: c0000000000b0170 c0000001fffebb70 c000000000a6ba00 0000000028000000…quoted
NIP [c000000000046978] doorbell_core_ipi+0x28/0x30 LR [c000000000046a38] doorbell_try_core_ipi+0xb8/0xf0 Call Trace: [c0000001fffebb70] [c0000001fffebba0] 0xc0000001fffebba0 (unreliable) [c0000001fffebba0] [c0000000000b0170] smp_pseries_cause_ipi+0x20/0x70 [c0000001fffebbd0] [c00000000004b02c] arch_send_call_function_single_ipi+0x8c/0xa0 [c0000001fffebbf0] [c0000000001de600] irq_work_queue_on+0xe0/0x130 [c0000001fffebc30] [c0000000001340c8] rto_push_irq_work_func+0xc8/0x120…quoted
Instruction dump: 60000000 60000000 3c4c00a2 384250b0 3d220009 392949c8 81290000 3929ffff 7d231838 7c0004ac 5463017e 64632800 <7c00191c> 4e800020 3c4c00a2 38425080 ---[ end trace eb842b544538cbdf ]---This is unusual and causing powerpc code to crash because the rt scheduler is telling irq_work_queue_on to queue work on this CPU. Is that something allowed? There's no warnings in there but it must be a rarely tested path, would it be better to ban it? Steven is this queue_work_on to self by design?I just figured that a irq_work_queue_on() would default to just irq_work_queue() if cpu == smp_processor_id().
Yeah it actually doesn't, it raises a CALL_FUNCTION IPI on the current CPU. Maybe it should do that.
I'm sure this case has been tested (on x86), as I stressed this quite a bit, and all it takes is to have an RT task queued on the CPU processing the current "push" logic when it's not holding the rq locks and read that the current CPU now has an overloaded RT task. If calling irq_work_queue_on() is really bad to do to the same CPU, then we can work on changing the logic to do something different if the CPU returned from rto_next_cpu() is the current CPU.
I don't actually know that it is bad. It does seem to violate the principle of least surprise for arch code though. From this bug report, I'm guessing it's pretty rare case to hit so I would worry about subtle bugs. And I just cc'ed you in case your code was actually not intending to do this.
Or would it be possible to just change irq_work_queue_on() to call irq_work_queue() if cpu == smp_processor_id()?
Something like this? kernel/irq_work: Do not raise an IPI when queueing work on the local CPU The QEMU powerpc/pseries machine model was not expecting a self-IPI, and it may be a bit surprising thing to do, so have irq_work_queue_on do local queueing when target is current CPU. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- kernel/irq_work.c | 78 ++++++++++++++++++++++++++--------------------- 1 file changed, 43 insertions(+), 35 deletions(-)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 6b7cdf17ccf8..f0e539d0f879 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c@@ -56,61 +56,69 @@ void __weak arch_irq_work_raise(void) */ } -/* - * Enqueue the irq_work @work on @cpu unless it's already pending - * somewhere. - * - * Can be re-enqueued while the callback is still in progress. - */ -bool irq_work_queue_on(struct irq_work *work, int cpu) +/* Enqueue on current CPU, work must already be claimed and preempt disabled */ +static void __irq_work_queue(struct irq_work *work) { - /* All work should have been flushed before going offline */ - WARN_ON_ONCE(cpu_is_offline(cpu)); - -#ifdef CONFIG_SMP - - /* Arch remote IPI send/receive backend aren't NMI safe */ - WARN_ON_ONCE(in_nmi()); + /* If the work is "lazy", handle it from next tick if any */ + if (work->flags & IRQ_WORK_LAZY) { + if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) && + tick_nohz_tick_stopped()) + arch_irq_work_raise(); + } else { + if (llist_add(&work->llnode, this_cpu_ptr(&raised_list))) + arch_irq_work_raise(); + } +} +/* Enqueue the irq work @work on the current CPU */ +bool irq_work_queue(struct irq_work *work) +{ /* Only queue if not already pending */ if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) - arch_send_call_function_single_ipi(cpu); - -#else /* #ifdef CONFIG_SMP */ - irq_work_queue(work); -#endif /* #else #ifdef CONFIG_SMP */ + /* Queue the entry and raise the IPI if needed. */ + preempt_disable(); + __irq_work_queue(work); + preempt_enable(); return true; } +EXPORT_SYMBOL_GPL(irq_work_queue); -/* Enqueue the irq work @work on the current CPU */ -bool irq_work_queue(struct irq_work *work) +/* + * Enqueue the irq_work @work on @cpu unless it's already pending + * somewhere. + * + * Can be re-enqueued while the callback is still in progress. + */ +bool irq_work_queue_on(struct irq_work *work, int cpu) { +#ifndef CONFIG_SMP + return irq_work_queue(work); + +#else /* #ifndef CONFIG_SMP */ + /* All work should have been flushed before going offline */ + WARN_ON_ONCE(cpu_is_offline(cpu)); + /* Only queue if not already pending */ if (!irq_work_claim(work)) return false; - /* Queue the entry and raise the IPI if needed. */ preempt_disable(); - - /* If the work is "lazy", handle it from next tick if any */ - if (work->flags & IRQ_WORK_LAZY) { - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) && - tick_nohz_tick_stopped()) - arch_irq_work_raise(); - } else { - if (llist_add(&work->llnode, this_cpu_ptr(&raised_list))) - arch_irq_work_raise(); - } - + if (cpu != smp_processor_id()) { + /* Arch remote IPI send/receive backend aren't NMI safe */ + WARN_ON_ONCE(in_nmi()); + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + arch_send_call_function_single_ipi(cpu); + } else + __irq_work_queue(work); preempt_enable(); return true; +#endif /* #else #ifndef CONFIG_SMP */ } -EXPORT_SYMBOL_GPL(irq_work_queue); + bool irq_work_needs_cpu(void) {
--
2.20.1