Thread (9 messages) 9 messages, 2 authors, 2019-03-31

Re: [RFC PATCH 3/3] powenv/mce: print additional information about mce error.

From: Michael Ellerman <mpe@ellerman.id.au>
Date: 2019-03-29 01:33:34

Mahesh J Salgaonkar [off-list ref] writes:
From: Mahesh Salgaonkar <redacted>

Print more information about mce error whether it is an hardware or
software error.

Some of the mce errors can be easily categorized as hardware or software
errors e.g. UEs are due to hardware error, where as error triggered due to
invalid usage of tlbie is a pure software bug. But not all the mce errors
can be easily categorize into either software or hardware. There are errors
like multihit errors which are usually result of a software bug, but in
some rare cases a hardware failure can cause a multihit error. In past, we
have seen case where after replacing faulty chip, multihit errors stopped
occurring. Same with parity errors, which are usually due to faulty hardware
but there are chances where multihit can also cause an parity error. Such
errors are difficult to determine what really caused it. Hence this patch
classifies mce errors into following four categorize:
	1. Hardware error:
		UE and Link timeout failure errors.
	2. Hardware error, small probability of software cause:
		SLB/ERAT/TLB Parity errors.
	3. Software error
		Invalid tlbie form.
	4. Software error, small probability of hardware failure
		SLB/ERAT/TLB Multihit errors.
I like the idea, but I think the phrasing is a little confusing.
Sample o/p:

[ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 00007fff9a59dc60 DAR: 000001003d740320 [Recovered]
[ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc
[ 1259.331345] MCE: CPU40: Software error, small probability of hardware failure
"Software error, small probability of hardware failure"

That can be read as "there has been a software error, *and now* there is
a small probability of a hardware failure".

I also think "probability" suggests we actually know the mathematical
probability of it being a hardware failure, which is not true.

Instead maybe we use:

	"Hardware error",
	"Probable hardware error (some chance of software cause)",
	"Software error",
	"Probable software error (some chance of hardware cause)",

??

cheers
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help