Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context
From: Sean Christopherson <seanjc@google.com>
Date: 2025-08-18 15:25:36
Also in:
kvm, kvm-riscv, kvmarm, linux-perf-users, linux-riscv, lkml, loongarch
On Mon, Aug 18, 2025, Peter Zijlstra wrote:
On Fri, Aug 15, 2025 at 08:55:25AM -0700, Sean Christopherson wrote:quoted
On Fri, Aug 15, 2025, Sean Christopherson wrote:quoted
On Fri, Aug 15, 2025, Peter Zijlstra wrote:quoted
quoted
diff --git a/kernel/events/core.c b/kernel/events/core.c index e1df3c3bfc0d..ad22b182762e 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c@@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data) task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST); } + arch_perf_load_guest_context(data);So I still don't understand why this ever needs to reach the generic code. x86 pmu driver and x86 kvm can surely sort this out inside of x86, no?It's definitely possible to handle this entirely within x86, I just don't love switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable(). It's not a sticking point for me if you strongly prefer something like this:diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 0e5048ae86fa..86b81c217b97 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c@@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu) lockdep_assert_irqs_disabled(); - perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC)); + perf_load_guest_context(); + + perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));Hmm, an argument for providing a dedicated perf_load_guest_lvtpc() APIs is that it would allow KVM to handle LVTPC writes in KVM's VM-Exit fastpath, i.e. without having to do a full put+reload of the guest context. So if we're confident that switching the host LVTPC outside of perf_{load,put}_guest_context() is functionally safe, I'm a-ok with it.Let me see. So the hardware sets Masked when it raises the interrupt. The interrupt handler clears it from software -- depending on uarch in 3 different places: 1) right at the start of the PMI 2) in the middle, right before enabling the PMU (writing global control) 3) at the end of the PMI the various changelogs adding that code mention spurious PMIs and malformed PEBS records. So the fun all happens when the guest is doing PMI and gets a VM-exit while still Masked. At that point, we can come in and completely rewrite the PMU state, reroute the PMI and enable things again. Then later, we 'restore' the PMU state, re-set LVTPC masked to the guest interrupt and 'resume'. What could possibly go wrong :/ Kan, I'm assuming, but not knowing, that writing all the PMU MSRs is somehow serializing state sufficient to not cause the above mentioned fails? Specifically, clearing PEBS_ENABLE should inhibit those malformed PEBS records or something? What if the host also has PEBS and we don't actually clear the bit? The current order ensures we rewrite LVTPC when global control is unset; I think we want to keep that.
Yes, for sure.
While staring at this, I note that perf_load_guest_context() will clear global ctrl, clear all the counter programming, and re-enable an empty pmu. Now, an empty PMU should result in global control being zero -- there is nothing run after all. But then kvm_mediated_pmu_load() writes an explicit 0 again. Perhaps replace this with asserting it is 0 instead?
Yeah, I like that idea, a lot. This? perf_load_guest_context(); /* * Sanity check that "loading" guest context disabled all counters, as * modifying the LVTPC while host perf is active will cause explosions, * as will loading event selectors and PMCs with guest values. * * VMX will enable/disable counters at VM-Enter/VM-Exit by atomically * loading PERF_GLOBAL_CONTROL. SVM effectively performs the switch by * configuring all events to be GUEST_ONLY. */ WARN_ON_ONCE(rdmsrq(kvm_pmu_ops.PERF_GLOBAL_CTRL)); perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
Anyway, this means that moving the LVTPC writing into kvm_mediated_pmu_load() as you suggest is identical. perf_load_guest_context() results in global control being 0, we then assert it is 0, and write LVTPC while it is still 0. kvm_pmu_load_guest_pmcs() will then frob the MSRs. OK, so *IF* doing the VM-exit during PMI is sound, this is something that needs a comment somewhere.
I'm a bit lost here. Are you essentially asking if it's ok to take a VM-Exit while the guest is handling a PMI? If so, that _has_ to work, because there are myriad things that can/will trigger a VM-Exit at any point while the guest is active.
Then going back again, is the easy part, since on the host side, we can never transition into KVM during a PMI.