Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
From: Borislav Petkov <bp@alien8.de>
Date: 2021-02-02 11:02:20
Also in:
linux-edac, lkml
On Mon, Feb 01, 2021 at 10:58:12AM -0800, Luck, Tony wrote:
On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:quoted
Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e., if you apply the patch on -rc3 and it explodes and if you apply the same patch on -rc5 and it works, then that could be a start... Yeah, don't have a better idea here. :-\I tried reporoducing (applied the original patch I posted back to -rc3) and the same issue stubbornly refused to show up again. But I did hit something with the same signature (overflow bit set in bank 1) while running my futex test (which has two processes mapping the poison page). This time I *do* understand what happened. The test failed when the two processes were running on the two hyperhtreads of the same core. Seeing overflow in this case is understandable because bank 1 MSRs on my test machine are shared between the HT threads. When I run the test again using taskset(1) to only allowing running on thread 0 of each core, it keeps going for hunderds of iterations. I'm not sure I can stitch together how this overflow also happened for my single process test. Maybe a migration from one HT thread to the other at an awkward moment?
Sounds plausible.
And the much more important question is, what is the code supposed to
do when that overflow *actually* happens in real life? Because IINM,
an overflow condition on the same page would mean killing the task to
contain the error and not killing the machine...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette