Thread (17 messages) 17 messages, 6 authors, 2025-07-30

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

From: Breno Leitao <leitao@debian.org>
Date: 2025-07-30 17:23:00
Also in: linux-acpi, linux-edac, linux-pci, lkml

Hello Mauro,

On Wed, Jul 30, 2025 at 06:21:37PM +0200, Mauro Carvalho Chehab wrote:
Em Wed, 30 Jul 2025 06:11:52 -0700
Breno Leitao [off-list ref] escreveu:
quoted
On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
quoted
In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
CPER_SEV_RECOVERABLE errors:  
Thanks. I was reading this code a bit more, and I want to make sure my
understanding is correct, giving I was confused about CORRECTED and
RECOVERABLE errors.

CPER_SEV_CORRECTED means it is corrected in the background, and the OS
was not even notified about it. That includes 1-bit ECC error.
THose are not the errors we are interested in, since they are irrelavant
to the OS.
Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture
such errors, as they may be a symptom of a hardware defect. In a matter
of fact, at rasdamon, thresholds can be set to trigger an action, like
for instance, disable memory blocks that contain defective memories.
Sorry, I meant that Hardware-corrected errors aren't relevant in the
context of this patch, where we are errors that the OS has some
influence and decision.
This is specially relevant on HPC and supercomputer workloads, where
it is a lot cheaper to disable a block of bad memory than to lose
an entire job because that could take several weeks of run time on
a supercomputer, just because a defective memory ended causing a
failure at the application.
Agree. These errors are used in several ways, including to detect
hardware aging and hardware replacement at maintenance windows.

In this patchset, I am more focused on what information to add to
crashdump, so, it makes it easy to correlate crashes to hardware events,
and RECOVERABLE are the main ones.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help