RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Thomas Gleixner <hidden>
Date: 2019-07-18 07:01:06
Also in:
lkml
On Thu, 18 Jul 2019, Dexuan Cui wrote:
quoted hunk ↗ jump to hunk
quoted
On Thu, 4 Jul 2019, Dexuan Cui wrote: This is the allocation when the CPU is brought online for the first time. So what effect has zeroing at allocation time vs. offlining and potentially receiving IPIs? That allocation is never freed. Neither the comment nor the changelog make any sense to me. tglxThat allocation was introduced by the commit a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages"). I think it's ok to not free the page when a CPU is offlined: every CPU uses only 1 page and CPU offlining is not really a very usual operation except for the scenario of hibernation (and suspend-to-memory), where the CPUs are quickly onlined again, when we resume from hibernation. IMO Vitaly intentionally decided to not free the page for simplicity of the code. When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU *always* uses hvp->apic_assist (which is updated by the hypervisor) to decide if it needs to write the EOI MSR: static void hv_apic_eoi_write(u32 reg, u32 val) { struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()]; if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1)) return; wrmsr(HV_X64_MSR_EOI, val, 0); } When a CPU (e.g. CPU1) is being offlined, on this CPU, we do: 1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU; 2. we finish the remaining work of stopping this CPU; 3. this CPU is completed stopped. Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0, and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for every interrupt received, otherwise the hypervisor may not deliver further interrupts, which may be needed to stop this CPU completely. So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line "wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest way is what I do in this patch. Alternatively, we can use the below patch:@@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu) local_irq_restore(flags); free_page((unsigned long)input_pg); - if (hv_vp_assist_page && hv_vp_assist_page[cpu]) + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) { + local_irq_save(flags); wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0); + hvp->apic_assist &= ~1; + local_irq_restore(flags); + } if (hv_reenlightenment_cb == NULL) return 0;This second version needs 3+ lines, so I prefer the one-line version. :-)
Those are two different things. The GPF_ZERO allocation makes sense on it's
own but it _cannot_ prevent the following scenario:
cpu_init()
if (!hvp)
hvp = vmalloc(...., GFP_ZERO);
...
hvp->apic_assist |= 1;
#1 cpu_die()
if (....)
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
---> IPI
if (!(hvp->apic_assist & 1))
wrmsr(APIC_EOI); <- PATH not taken
#3 cpu is dead
cpu_init()
if (!hvp)
hvp = vmalloc(....m, GFP_ZERO); <- NOT TAKEN because hvp != NULL
So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.
Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
memory has a defined state, but that's a different story.
The 3 liner patch above makes way more sense and you can spare the
local_irq_save/restore by moving the whole condition into the
irq_save/restore region above.
Thanks,
tglx