Thread (18 messages) 18 messages, 5 authors, 2024-01-27

Re: [PATCH v2 0/2] Update mce_record tracepoint

From: Borislav Petkov <bp@alien8.de>
Date: 2024-01-27 12:19:39
Also in: linux-edac, lkml

On Fri, Jan 26, 2024 at 10:01:29PM +0000, Luck, Tony wrote:
PPIN: Nice to have. People that send stuff to me are terrible about
providing surrounding details. The record already includes
CPUID(1).EAX ... so I can at least skip the step of asking them which
CPU family/model/stepping they were using). But PPIN can be recovered
(so long as the submitter kept good records about which system
generated the record).
Yes.
MICROCODE: Must have. Microcode version can be changed at run time.
Going back to the system to check later may not give the correct
answer to what was active at the time of the error. Especially for an
error reported while a microcode update is waling across the CPUs
poking the MSR on each in turn.
Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help