Thread (23 messages) 23 messages, 5 authors, 2023-12-13

Re: [PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block()

From: Sean Christopherson <seanjc@google.com>
Date: 2023-12-13 22:59:22
Also in: kvm, kvm-riscv, linux-arm-kernel, linux-mips, linux-riscv, lkml

On Thu, Dec 14, 2023, Maxim Levitsky wrote:
On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote:
quoted
On Sun, Dec 10, 2023, Jim Mattson wrote:
quoted
On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson [off-list ref] wrote:
quoted
Doh.  We got the less obvious cases and missed the obvious one.

Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt().  That
thing should really be folded into vmx_has_nested_events().

Good gravy.  And vmx_interrupt_blocked() does the wrong thing because that
specifically checks if L1 interrupts are blocked.

Compile tested only, and definitely needs to be chunked into multiple patches,
but I think something like this mess?
The proposed patch does not fix the problem. In fact, it messes things
up so much that I don't get any test results back.
Drat.
quoted
Google has an internal K-U-T test that demonstrates the problem. I
will post it soon.
Received, I'll dig in soonish, though "soonish" might unfortunately might mean
2024.
Hi,

So this is what I think:

KVM does have kvm_guest_apic_has_interrupt() for this exact purpose,
to check if nested APICv has a pending interrupt before halting.
For all intents and purposes, so was nested_ops->has_events().  I don't see
any reason to have two APIs that do the same thing, and the call to
kvm_guest_apic_has_interrupt() is wrong in that it doesn't verify that IRQs are
enabled for _L2_.  That's why my preference is to fold the two together.
However the problem is bigger - with APICv we have in essence 2 pending
interrupt bitmaps - the PIR and the IRR, and to know if the guest has a
pending interrupt one has in theory to copy PIR to IRR, then see if the max
is larger then the current PPR.
Yeah, this is what my untested hack-a-patch tried to do.
Since we don't want to write to guest memory,
The changelog is misleading/wrong.  Writing guest memory is ok, what isn't safe
is blocking or sleeping, i.e. KVM must not trigger a host page fault due to
accessing a page that's been swapped out.  Read vs. write doesn't matter.

So KVM can safely read and write guest memory so long as it already mapped by 
kvm_vcpu_map() (or I suppose if we wrapped an access with pagefault_disable(),
but I can't think of a sane reason to do that).  E.g. nVMX can access a vCPU's
PID mapping, but synthesizing a nested VM-Exit will cause explosions on nSVM.
and the IRR here resides in the guest memory, I guess we have to do a
'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call it from
kvm_guest_apic_has_interrupt().
nested_ops->has_events() is the much better fit, e.g. the naming won't get weird
and we can gate the whole thing on is_guest_mode().  Though we probably need a
wrapper to handle any commonalities between nVMX and nSVM.
What do you think? I can prepare a patch for this.
As above, this is what I tried to do, sort of.  Though it's obviously broken.  We
don't need a full dry-run because KVM only needs to detect events that are unique
to L2, e.g. nVMX's preemption timer, MTF, and pending virtual interrupts (hmm,
I suspect nSVM's vNMI is broken too).  Things like INIT and SMI don't require
nested virtualization awareness because the event itself is tracked for the vCPU
as a whole.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help