Thread (18 messages) 18 messages, 5 authors, 2024-01-27

RE: [PATCH v2 0/2] Update mce_record tracepoint

From: "Luck, Tony" <tony.luck@intel.com>
Date: 2024-01-26 17:10:24
Also in: linux-edac, lkml

quoted
8 bytes for PPIN, 4 more for microcode.
I know, nothing leads to bloat like 0.01% here, 0.001% there...
12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much)
   = 0.00000001746%

We will need 57000 changes like this one before we get to 0.001% :-)
quoted
Number of recoverable machine checks per system .... I hope the
monthly rate should be countable on my fingers...
That's not the point. Rather, when you look at MCE reports, you pretty
much almost always go and collect additional information from the target
machine because you want to figure out what exactly is going on.

So what's stopping you from collecting all that static information
instead of parrotting it through the tracepoint with each error?
PPIN is static. So with good tracking to keep source platform information
attached to the error record as it gets passed around people trying to triage
the problem, you could say it can be retrieved later (or earlier when setting
up a database of attributes of each machine in the fleet.

But the key there is keeping the details of the source machine attached to
the error record. My first contact with machine check debugging is always
just the raw error record (from mcelog, rasdaemon, or console log).
quoted
PPIN is useful when talking to the CPU vendor about patterns of
similar errors seen across a cluster.
I guess that is perhaps the only thing of the two that makes some sense
at least - the identifier uniquely describes which CPU the error comes
from...
quoted
MICROCODE - gives a fast path to root cause problems that have already
been fixed in a microcode update.
But that, nah. See above.
Knowing which microcode version was loaded on a core *at the time of the error*
is critical. You've spent enough time with Ashok and Thomas tweaking the Linux
microcode driver to know that going back to the machine the next day to ask about
microcode version has a bunch of ways to get a wrong answer.

-Tony
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help