RE: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from... | linux-pm

RE: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus

From: "Chen, Yu C" <yu.c.chen@intel.com>
Date: 2016-07-07 02:50:23
Also in: lkml

-----Original Message-----
From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net]
Sent: Thursday, July 07, 2016 8:33 AM
To: Chen, Yu C; James Morse
Cc: linux-pm@vger.kernel.org; Thomas Gleixner; H. Peter Anvin; Pavel Machek;
Borislav Petkov; Peter Zijlstra; Ingo Molnar; Len Brown; x86@kernel.org; linux-
kernel@vger.kernel.org
Subject: Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked
from disable_nonboot_cpus

On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:

quoted

Stress test from Varun Koyyalagunta reports that, the nonboot CPU
would hang occasionally, when resuming from hibernation. Further
investigation shows that, the precise stage when nonboot CPU hangs, is
the time when the nonboot CPU been woken up incorrectly, and tries to
monitor the mwait_ptr for the second time, then an exception is
triggered due to illegal vaddr access, say, something like, 'Unable to
handler kernel address of 0xffff8800ba800010...'

Further investigation shows that, this exception is caused by
accessing a page without PRESENT flag, because the pte entry for this
vaddr is zero. Here's the scenario how this problem
happens: Page table for direct mapping is allocated dynamically by
kernel_physical_mapping_init, it is possible that in the resume
process, when the boot CPU is trying to write back pages to their
original address, and just right to writes to the monitor mwait_ptr
then wakes up one of the nonboot CPUs, since the page table currently
used by the nonboot CPU might not the same as it is before the
hibernation, an exception might occur due to inconsistent page table.

First try is to get rid of this problem by changing the monitor
address from task.flag to zero page, because no one would write data
to zero page. But there is still problem because of a ping-pong wake
up scenario in mwait_play_dead:

One possible implementation of a clflush is a read-invalidate snoop,
which is what a store might look like, so cflush might break the mwait.

1. CPU1 wait at zero page
2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
then the nonboot CPUs never sleep for long.

So it's better to monitor different address for each nonboot CPUs,
however since there is only one zero page, at most:
PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 on a
x86_64, apparently it's not enough for servers, maybe more zero pages
are required.

So choose a new solution as Brian suggested, to put the nonboot CPUs
into hlt before resume, without touching any memory during s/r.
Theoretically there might still be some problems if some of the CPUs
have already been put offline, but since the case is very rare and
users can work around it, we do not deal with this special case in
kernel for now.

BTW, as James mentioned, he might want to encapsulate
disable_nonboot_cpus into arch-specific, so this patch might need small

change after that.

quoted

Comments and suggestions would be appreciated.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-and-tested-by: Varun Koyyalagunta <redacted>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Below is my sort of version of this (untested) and I did it this way, because the
issue is specific to resume from hibernation (the workaround need not be
applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()
thing may be useful to arm64 too if I'm not mistaken (James?).

James might want a flag to distinguish whether it is from suspend or resume,
in his arch-specific disabled_nonboot_cpus?

and this patch works on my xeon.
Tested-by: Chen Yu <yu.c.chen@intel.com>

Actually, if arm64 uses it too, the __weak implementation can be dropped,
because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
(x86 and arm64 are the only users of that).

Thanks,
Rafael

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help