Thread (11 messages) 11 messages, 3 authors, 2025-08-05

Re: [PATCH v4] vmcoreinfo: Track and log recoverable hardware errors

From: Dave Hansen <hidden>
Date: 2025-08-01 16:24:44
Also in: linux-acpi, linux-edac, linux-pci, lkml

On 8/1/25 08:13, Breno Leitao wrote:
Hello Dave,

On Fri, Aug 01, 2025 at 07:52:17AM -0700, Dave Hansen wrote:
quoted
On 8/1/25 05:31, Breno Leitao wrote:
quoted
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption.
...

Are there patches for the consumer side of this, too? Or do humans
looking at crash dumps have to know what to go digging for?

In either case, don't we need documentation for this new ABI?
I have considered this, but the documentation for vmcoreinfo
(admin-guide/kdump/vmcoreinfo.rst) solely documents what is explicitly
exposed by vmcore, which differs from the nature of these counters.

Where would be a good place to document it?
I'm not picky. But you also didn't quite answer the question I was asking.

Is this new data for humans or machines to read?
quoted
quoted
@@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	}
 
 out:
+	/* Given it didn't panic, mark it as recoverable */
+	hwerr_log_error_type(HWERR_RECOV_MCE);
+
Does "MCE" mean anything outside of x86?
AFAIK this is a MCE concept.
I'm not really sure what that response means.

There are two problems here. First is that HWERR_RECOV_MCE is defined in
arch-generic code, but it may never get used by anything other than x86
when CONFIG_X86_MCE.

That also completely wastes space in your data structure when
HWERR_RECOV_MCE=n. Not a huge deal as-is, but it's still a bit sloppy
and wasteful.

...
quoted
quoted
+	hwerr_data[src].count++;
+	hwerr_data[src].timestamp = ktime_get_real_seconds();
+}
+EXPORT_SYMBOL_GPL(hwerr_log_error_type);
I'd also love to hear more about _actual_ users of this. Surely, someone
hit a real world problem and thought this would be a nifty solution. Who
was that? What problem did they hit? How does this help them?
Yes, this has been extensively discussed in the very first version of
the patch. Borislav raised the same question, which was discussed in the
following link:

https://lore.kernel.org/all/20250715125327.GGaHZPRz9QLNNO-7q8@fat_crate.local/ (local)
When someone raises a concern, we usually try to alleviate the concern
in a way that is self-contained in the next posting. A cover letter with
a full explanation would be one place to put the reasoning, for example.

But expecting future reviewers to plod through all the old threads isn't
really feasible.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help