Thread (5 messages) 5 messages, 4 authors, 2017-10-25

Re: [PATCH] powernv: Avoid checkstop on HMI and MCE

From: Michael Neuling <hidden>
Date: 2017-10-25 10:59:42

On Wed, 2017-10-25 at 12:16 +0200, Michael Ellerman wrote:
Michael Neuling [off-list ref] writes:
=20
quoted
On an unrecoverable HMI or MCE only generate an checkstop (via
PLATFORM ERROR opal reboot call) when panic_on_oops is set.
=20
We currently generate an checkstop as an attempt for the FSP to grab a
dump and then reboot us. Unfortunately this never works and no one
=20
Never? WT#.
Well no one I've talked but I'm posting this so someone will stand up and s=
ay
they want it.
quoted
I've talked to has ever seen a resulting dump, let alone got useful
information from it.
=20
Even worse, the checkstop gets in the way of debugging real
problems. If we hit a software bug that results in this, we get no
opportunity to debug it live. Similarly if the bug is due to hardware
that is not in the dump (say PCI or NVLINK GPU), we get no information
in the dump about that hardware.
=20
So let's remove it unless someone sets panic_on_oops.
=20
Nick just rewrote pnv_platform_error_reboot(), so please talk to him to
make sure you're not stepping on each other.
OK, will do.
quoted
diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c
b/arch/powerpc/platforms/powernv/opal-hmi.c
index c9e1a4ff29..23780970d0 100644
--- a/arch/powerpc/platforms/powernv/opal-hmi.c
+++ b/arch/powerpc/platforms/powernv/opal-hmi.c
@@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *=
work)
quoted
 			print_hmi_event_info(hmi_evt);
 		}
=20
+		if (!panic_on_oops) {
+			die("Unrecoverable HMI exception", NULL, SIGBUS);
+			return;
=20
I don't think we should return.
=20
Otherwise we risk persisting corrupt data to disk and so on.
ok
If we're getting unrecoverable HMI/MCEs that are not actually indicative
of something bad happening then we need to filter those out somewhere.
We hit this with some new HMIs for NVLINK and the Vector Load one, so we ne=
ed to
handle them, and we have code that does (or is coming).

In the mean while, it's very hard to debug them once we xstop.

Mikey
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help