Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

From: Ira Weiny <ira.weiny@intel.com>
Date: 2024-02-14 16:41:03
Also in: linux-acpi, linux-cxl, lkml

Jonathan Cameron wrote:

quoted hunk ↗ jump to hunk

On Tue, 06 Feb 2024 14:15:32 -0800
Ira Weiny [off-list ref] wrote:

quoted

Smatch caught that cxl_cper_post_event() is called with a spinlock held
or preemption disabled.[1]  The callback takes the device lock to
perform address translation and therefore might sleep.  The record data
is released back to BIOS in ghes_clear_estatus() which requires it to be
copied for use in the workqueue.

Copy the record to a lockless list and schedule a work item to process
the record outside of atomic context.

[1] https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/ (local)

Reported-by: Dan Carpenter <redacted>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

+CC tracing folk for the splat below (the second one as first one is fixed!)

Lock debugging is slow (on an emulated machine) :(
Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch
which is only one I've put out with the FW first injection patches so far

For reference without this patch I got a nice spat identifying the original bug.
So far so good.

With this patch (and tp_printk in command line and trace points enabled)
[  193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  193.049636] {1}[Hardware Error]: event severity: recoverable
[  193.050472] {1}[Hardware Error]:  Error 0, type: recoverable
[  193.051337] {1}[Hardware Error]:   section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
[  193.052270] {1}[Hardware Error]:   section length: 0x90
[  193.053402] {1}[Hardware Error]:   00000000: 00000090 00000007 00000000 0d938086  ................
[  193.055036] {1}[Hardware Error]:   00000010: 000e0000 00000000 00000005 00000000  ................
[  193.058592] {1}[Hardware Error]:   00000020: 00000180 00000000 1626fa24 17b3b158  ........$.&.X...
[  193.062289] {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
[  193.065959] {1}[Hardware Error]:   00000040: 000007d0 00000000 0fc00307 05210300  ..............!.
[  193.069782] {1}[Hardware Error]:   00000050: 72690000 6d207361 00326d65 00000000  ..iras mem2.....
[  193.072760] {1}[Hardware Error]:   00000060: 00000000 00000000 00000000 00000000  ................
[  193.074062] {1}[Hardware Error]:   00000070: 00000000 00000000 00000000 00000000  ................
[  193.075346] {1}[Hardware Error]:   00000080: 00000000 00000000 00000000 00000000  ................

I rebased after this so now we get the smaller print - but that's not really relevant here.

[  193.086589] cxl_general_media: memdev=mem1 host=0000:0e:00.0 serial=5 log=Warning : time=1707903675590441508 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 61 73 20 6d 65 6d 32 00 00 00 00 00 00 00 validity_flags='CHANNEL|RANK|DEVICE|COMPONENT'
[  193.087181]                                                                                                                                                                                                                                                      
[  193.087361] =============================
[  193.087399] [ BUG: Invalid wait context ]
[  193.087504] 6.8.0-rc3 #1200 Not tainted
[  193.087660] -----------------------------
[  193.087858] kworker/3:0/31 is trying to lock:
[  193.087966] ffff0000c0ce1898 (&port_lock_key){-.-.}-{3:3}, at: pl011_console_write+0x148/0x1c8
[  193.089754] other info that might help us debug this:
[  193.089820] context-{5:5}[  193.089900] 8 locks held by kworker/3:0/31:
[  193.089990]  #0: ffff0000c0018738 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x154/0x500
[  193.090439]  #1: ffff800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: process_one_work+0x154/0x500
[  193.090718]  #2: ffff800082883310 (cxl_cper_rw_sem){++++}-{4:4}, at: cxl_cper_work_fn+0x2c/0xb0
[  193.091019]  #3: ffff0000c0a7b1a8 (&dev->mutex){....}-{4:4}, at: pci_dev_lock+0x28/0x48
[  193.091431]  #4: ffff800082738f18 (tracepoint_iter_lock){....}-{2:2}, at: trace_event_buffer_commit+0xd8/0x2c8
[  193.091772]  #5: ffff8000826b3ce8 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x124/0x398
[  193.092031]  #6: ffff8000826b3d40 (console_srcu){....}-{0:0}, at: console_flush_all+0x88/0x4b0
[  193.092363]  #7: ffff8000826b3ef8 (console_owner){....}-{0:0}, at: console_flush_all+0x1bc/0x4b0
[  193.092799] stack backtrace:
[  193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200
[  193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown
[  193.093468] Workqueue: events cxl_cper_work_fn
[  193.093782] Call trace:
[  193.094004]  dump_backtrace+0xa4/0x130
[  193.094145]  show_stack+0x20/0x38
[  193.094231]  dump_stack_lvl+0x60/0xb0
[  193.094315]  dump_stack+0x18/0x28
[  193.094395]  __lock_acquire+0x9e8/0x1ee8
[  193.094477]  lock_acquire+0x118/0x2e8
[  193.094557]  _raw_spin_lock+0x50/0x70
[  193.094820]  pl011_console_write+0x148/0x1c8
[  193.094904]  console_flush_all+0x218/0x4b0
[  193.094985]  console_unlock+0x74/0x140
[  193.095066]  vprintk_emit+0x230/0x398
[  193.095146]  vprintk_default+0x40/0x58
[  193.095226]  vprintk+0x98/0xb0
[  193.095306]  _printk+0x64/0x98
[  193.095385]  trace_event_buffer_commit+0x138/0x2c8
[  193.095467]  trace_event_raw_event_cxl_general_media+0x1a8/0x280 [cxl_core]
[  193.096191]  __traceiter_cxl_general_media+0x50/0x78 [cxl_core]
[  193.096359]  cxl_event_trace_record+0x204/0x2d0 [cxl_core]
[  193.096520]  cxl_cper_event_call+0xb0/0xe0 [cxl_pci]

The rw_sem is held to protect the callback pointer.

[  193.096713]  cxl_cper_work_fn+0x7c/0xb0
[  193.096808]  process_one_work+0x1f4/0x500
[  193.096891]  worker_thread+0x1f0/0x3f0
[  193.096971]  kthread+0x110/0x120
[  193.097052]  ret_from_fork+0x10/0x20

So I'm not sure how to fix this or if we even want to.

We could try and release locks before calling the tracepoint but that looks
very fiddly and would require ghes.c to be able to see more of the
CXL tracepoint infrastructure which isn't great.

Just because I was feeling cheeky I did a quick test with following.
I have a sneaky suspicion this won't got down well even though it 'fixes'
our issue...  My google fu / lore search terms are failing to find
much previous discussion of this.

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9ff8a439d674..7ee45f22f56f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c

@@ -2957,7 +2957,7 @@ static void output_printk(struct trace_event_buffer *fbuffer)
        iter->ent = fbuffer->entry;
        event_call->event.funcs->trace(iter, 0, event);
        trace_seq_putc(&iter->seq, 0);
-       printk("%s", iter->seq.buffer);
+       printk_deferred("%s", iter->seq.buffer);

        raw_spin_unlock_irqrestore(&tracepoint_iter_lock, flags);
 }

My assumption is similar views will apply here to Peter Zijlstra's comment on
not using printk_deferred.

https://lore.kernel.org/all/20231010141244.GM377@noisy.programming.kicks-ass.net/ (local)

Note I also tried the hacks Peter links to from there. They trip issues in the initial
CPER print - so I assume not appropriate here.

So I'm thinking this is a won't fix - wait for the printk rework to land and
assume this will be resolved as well?

Or could we avoid the situation entirely by using a static call?

The work queue still needs to be created because of the atomicness of
ghes_do_proc() but it avoids cxl_cper_rw_sem.

I think the hardest thing may be writing the commit message to explain all
this.  :-(

Ira

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help