Re: [PATCH v2 0/2] Update mce_record tracepoint
From: Borislav Petkov <bp@alien8.de>
Date: 2024-01-26 21:11:52
Also in:
linux-edac, lkml
On Fri, Jan 26, 2024 at 08:49:03PM +0000, Luck, Tony wrote:
Every patch that adds new code or data structures adds to the kernel memory footprint. Each should be considered on its merits. The basic question being: "Is the new functionality worth the cost?" Where does it end? It would end if Linus declared: "Linux is now complete. Stop sending patches". I.e. it is never going to end.
No, it's not that - it is the merit thing which determines.
1) PPIN Cost = 8 bytes. Benefit: Emdeds a system identifier into the trace record so there can be no ambiguity about which machine generated this error. Also definitively indicates which socket on a multi-socket system. 2) MICROCODE Cost = 4 bytes Benefit: Certainty about the microcode version active on the core at the time the error was detected. RAS = Reliability, Availability, Serviceability These changes fall into the serviceability bucket. They make it easier to diagnose what went wrong.
So does dmesg. Let's add it to the tracepoint...
But no, that's not the right question to ask.
It is rather: which bits of information are very relevant to an error
record and which are transient enough so that they cannot be gathered
from a system by other means or only gathered in a difficult way, and
should be part of that record.
The PPIN is not transient but you have to go map ->extcpu to the PPIN so
adding it to the tracepoint is purely a convenience thing. More or less.
The microcode revision thing I still don't buy but it is already there
so whateva...
So we'd need a rule hammered out and put there in a prominent place so
that it is clear what goes into struct mce and what not.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette