Thread (14 messages) 14 messages, 3 authors, 2021-01-08

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2021-01-07 00:42:47
Also in: lkml
Subsystem: the rest, x86 architecture (32-bit and 64-bit), x86 mce infrastructure · Maintainers: Linus Torvalds, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Tony Luck

On Thu, Jan 07, 2021 at 12:26:19AM +0000, Luck, Tony wrote:
quoted
Please see below for an updated patch.
Yes. That worked:

[   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

I guess that more than one CPU hit the timeout and so your new message was printed twice
before the panic code took over?
Could well be.

It would be easy to add a flag that allowed only one CPU to print the
message.  Does that make sense?  (See off-the-cuff probably-broken
delta patch below for one approach.)
Once again, the whole of socket 1 is MIA rather than just the pair of threads on one of the cores there.
But that's a useful improvement (eliminating the other three sockets on this system).

Tested-by: Tony Luck <tony.luck@intel.com>
Thank you very much!  I will apply this.

							Thanx, Paul

------------------------------------------------------------------------
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7a6e1f3..b46ac56 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -882,6 +882,7 @@ static atomic_t mce_callin;
  */
 static cpumask_t mce_present_cpus;
 static cpumask_t mce_missing_cpus;
+static atomic_t mce_missing_cpus_gate;
 
 /*
  * Check if a timeout waiting for other CPUs happened.
@@ -900,7 +901,7 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1) {
+		if (mca_cfg.tolerant <= 1 && !atomic_xchg(&mce_missing_cpus_gate, 1)) {
 			if (cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
 				pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
 					__func__, cpumask_pr_args(&mce_missing_cpus));
@@ -1017,6 +1018,7 @@ static int mce_start(int *no_way_out)
 	 */
 	order = atomic_inc_return(&mce_callin);
 	cpumask_set_cpu(smp_processor_id(), &mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 
 	/*
 	 * Wait for everyone.
@@ -1126,6 +1128,7 @@ static int mce_end(int order)
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
 	cpumask_clear(&mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 	barrier();
 
 	/*
@@ -2725,6 +2728,7 @@ static void mce_reset(void)
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
 	cpumask_clear(&mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 }
 
 static int fake_panic_get(void *data, u64 *val)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help