Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery
From: "Luck, Tony" <tony.luck@intel.com>
Date: 2021-07-22 15:21:39
Also in:
lkml
On Thu, Jul 22, 2021 at 06:54:37AM -0700, Jue Wang wrote:
This patch assumes the UC error consumed in kernel is always the same UC. Yet it's possible two UCs on different pages are consumed in a row. The patch below will panic on the 2nd MCE. How can we make the code works on multiple UC errors?quoted
+ int count = ++current->mce_count; + + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + current->mce_kill_me.func = func; + } ...... + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg);
The issue is getting the information about the location of the error from the machine check handler to the "task_work" function that processes it. Currently there is a single place to store the address of the error in the task structure: current->mce_addr = m->addr; Plausibly that could be made into an array, indexed by current->mce_count to save mutiple addresses (perhaps also need mce_kflags, mce_ripv, etc. to also be arrays). But I don't want to pre-emptively make such a change without some data to show that situations arise with multiple errors to different addresses: 1) Actually occur 2) Would be recovered if we made the change. The first would be indicated by seeing the: "Machine checks to different user pages" panic. You'd have to code up the change to have arrays to confirm that would fix the problem. -Tony