Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs... | linuxppc-dev

Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion

From: Marc Zyngier <maz@kernel.org>
Date: 2025-01-13 19:38:12
Also in: kvm, kvmarm, linux-arm-kernel, lkml

On Mon, 13 Jan 2025 18:58:45 +0000,
Sean Christopherson [off-list ref] wrote:

On Mon, Jan 13, 2025, Marc Zyngier wrote:

quoted

On Mon, 13 Jan 2025 15:44:28 +0000,
Sean Christopherson [off-list ref] wrote:

quoted

On Sat, Jan 11, 2025, Marc Zyngier wrote:

quoted

On Sat, 11 Jan 2025 01:24:48 +0000,
Sean Christopherson [off-list ref] wrote:

quoted

Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
that KVM_RUN needs to be re-executed prior to save/restore in order to
complete the instruction/operation that triggered the userspace exit.

KVM's current approach of adding notes in the Documentation is beyond
brittle, e.g. there is at least one known case where a KVM developer added
a new userspace exit type, and then that same developer forgot to handle
completion when adding userspace support.

Is this going to fix anything? If they couldn't be bothered to read
the documentation, let alone update it, how is that going to be
improved by extra rules and regulations?

I don't see how someone ignoring the documented behaviour of a given
exit reason is, all of a sudden, have an epiphany and take a *new*
flag into account.

The idea is to reduce the probability of introducing bugs, in KVM or userspace,
every time KVM attaches a completion callback.  Yes, userspace would need to be
updated to handle KVM_RUN_NEEDS_COMPLETION, but once that flag is merged, neither
KVM's documentation nor userspace would never need to be updated again.  And if
all architectures took an approach of handling completion via function callback,
I'm pretty sure we'd never need to manually update KVM itself either.

You are assuming that we need this completion, and I dispute this
assertion.

Ah, gotcha.

quoted

+The pending state of the operation for such exits is not preserved in state
+which is visible to userspace, thus userspace should ensure that the operation
+is completed before performing state save/restore, e.g. for live migration.
+Userspace can re-enter the guest with an unmasked signal pending or with the
+immediate_exit field set to complete pending operations without allowing any
+further instructions to be executed.
+
+Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
+and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
+KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
+KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.

So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
must be present for all of these exits, right? And from what I can
tell, this capability is unconditionally advertised.

Yet, you don't amend arm64 to publish that flag. Not that I think this
causes any issue (even if you save the state at that point without
reentering the guest, it will be still be consistent), but that
directly contradicts the documentation (isn't that ironic? ;-).

It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():

	if (run->exit_reason == KVM_EXIT_MMIO) {
		ret = kvm_handle_mmio_return(vcpu);
		if (ret <= 0)
			return ret;
	}

That's satisfying a load from the guest forwarded to userspace.

And MMIO stores, no?  I.e. PC needs to be incremented on stores as well.

Yes, *after* the store as completed. If you replay the instruction,
the same store comes out.

quoted

If the VMM did a save of the guest at this stage, restored and resumed it,
*nothing* bad would happen, as PC still points to the instruction that got
forwarded. You'll see the same load again.

But replaying an MMIO store could cause all kinds of problems, and even MMIO
loads could theoretically be problematic, e.g. if there are side effects in the
device that trigger on access to a device register.

But that's the VMM's problem. If it has modified its own state and
doesn't return to the guest to complete the instruction, that's just
as bad as a load, which *do* have side effects as well.

Overall, the guest state exposed by KVM is always correct, and
replaying the instruction is not going to change that. It is if the
VMM is broken that things turn ugly *for the VMM itself*, and I claim
that no amount of flag being added is going to help that.

	M.

-- 
Without deviation from the norm, progress is not possible.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help