Thread (5 messages) 5 messages, 3 authors, 2025-05-08

Re: [PATCH v7] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

From: Shuai Xue <xueshuai@linux.alibaba.com>
Date: 2025-05-07 12:03:35
Also in: linux-edac, linux-pci, lkml


在 2025/5/7 19:09, Jonathan Cameron 写道:
On Wed,  7 May 2025 09:15:35 +0800
Shuai Xue [off-list ref] wrote:
quoted
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability.

To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
tracepoint for hotplug event to help healthy check, and generate
tracepoints for pcie hotplug event. Add enum pci_hotplug_event in
include/uapi/linux/pci.h so applications like rasdaemon can register
tracepoint event handlers for it.

The output like below:

$ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
     <...>-206     [001] .....    40.373870: pci_hp_event: 0000:00:02.0 slot:10, event:Link Down

     <...>-206     [001] .....    40.374871: pci_hp_event: 0000:00:02.0 slot:10, event:Card not present

Suggested-by: Lukas Wunner <lukas@wunner.de>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
FWIW looks good to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Thanks :)
Any userspace tooling planned for this?
Yep, we plan to monitor this tracepoint in rasdaemon after this patch merged.

Best Regards,
Shuai
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help