Thread (22 messages) 22 messages, 3 authors, 2021-09-20

Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

From: Borislav Petkov <bp@alien8.de>
Date: 2021-09-13 09:25:12
Also in: linux-edac, lkml

Possibly related (same subject, not in this thread)

On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote:
Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send
a SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that was used
in the first call, the net result is an entry on the current->task_works
list that points to itself. When task_work_run() is called it loops
forever in this code:

        do {
                next = work->next;
                work->func(work);
                work = next;
                cond_resched();
        } while (work);

Add a counter (current->mce_count) to keep track of repeated machine
checks before task_work() is called. First machine check saves the address
information and calls task_work_add(). Subsequent machine checks before
that task_work call back is executed check that the address is in the
same page as the first machine check (since the callback will offline
exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Cc: <redacted>
What about a Fixes: tag?

I guess backporting this to the respective kernels is predicated upon
the existence of those other "places" in the kernel where code assumes
the EFAULT was because of a #PF.

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help