[PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

From: Christoffer Dall <hidden>
Date: 2015-06-08 11:35:24
Also in: kvm, kvmarm

On Fri, Jun 05, 2015 at 05:24:07AM -0700, Mario Smarduch wrote:

On 06/02/2015 02:27 AM, Christoffer Dall wrote:

quoted

On Mon, Jun 01, 2015 at 08:48:22AM -0700, Mario Smarduch wrote:

quoted

On 05/30/2015 11:59 PM, Christoffer Dall wrote:

quoted

Hi Mario,

On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:

quoted

On 05/28/2015 11:49 AM, Christoffer Dall wrote:

quoted

Until now we have been calling kvm_guest_exit after re-enabling
interrupts when we come back from the guest, but this has the
unfortunate effect that CPU time accounting done in the context of timer
interrupts occurring while the guest is running doesn't properly notice
that the time since the last tick was spent in the guest.

Inspired by the comment in the x86 code, move the kvm_guest_exit() call
below the local_irq_enable() call and change __kvm_guest_exit() to
kvm_guest_exit(), because we are now calling this function with
interrupts enabled.  We have to now explicitly disable preemption and
not enable preemption before we've called kvm_guest_exit(), since
otherwise we could be preempted and everything happening before we
eventually get scheduled again would be accounted for as guest time.

At the same time, move the trace_kvm_exit() call outside of the atomic
section, since there is no reason for us to do that with interrupts
disabled.

Signed-off-by: Christoffer Dall <redacted>
---
This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
rework recently posted by Christian Borntraeger.  I hope I got the logic
of this right, there were 2 slightly worrying facts about this:

First, we now enable and disable and enable interrupts on each exit
path, but I couldn't see any performance overhead on hackbench - yes the
only benchmark we care about.

Second, looking at the ppc and mips code, they seem to also call
kvm_guest_exit() before enabling interrupts, so I don't understand how
guest CPU time accounting works on those architectures.

Changes since v1:
 - Tweak comment and commit text based on Marc's feedback.
 - Explicitly disable preemption and enable it only after kvm_guest_exit().

 arch/arm/kvm/arm.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e41cb11..fe8028d 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c

@@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		kvm_vgic_flush_hwstate(vcpu);
 		kvm_timer_flush_hwstate(vcpu);
 
+		preempt_disable();
 		local_irq_disable();
 
 		/*

@@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
 			local_irq_enable();
+			preempt_enable();
 			kvm_timer_sync_hwstate(vcpu);
 			kvm_vgic_sync_hwstate(vcpu);
 			continue;

@@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
 
 		vcpu->mode = OUTSIDE_GUEST_MODE;
-		__kvm_guest_exit();
-		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
+		/*
+		 * Back from guest
+		 *************************************************************/
+
 		/*
 		 * We may have taken a host interrupt in HYP mode (ie
 		 * while executing the guest). This interrupt is still

@@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		local_irq_enable();
 
 		/*
-		 * Back from guest
-		 *************************************************************/
+		 * We do local_irq_enable() before calling kvm_guest_exit() so
+		 * that if a timer interrupt hits while running the guest we
+		 * account that tick as being spent in the guest.  We enable
+		 * preemption after calling kvm_guest_exit() so that if we get
+		 * preempted we make sure ticks after that is not counted as
+		 * guest time.
+		 */
+		kvm_guest_exit();
+		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
+		preempt_enable();
+
 
 		kvm_timer_sync_hwstate(vcpu);
 		kvm_vgic_sync_hwstate(vcpu);

Hi Christoffer,
 so currently we take a snap shot when we enter the guest
(tsk->vtime_snap) and upon exit add the time we spent in
the guest and update accrued time, which appears correct.

not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
am I missing something obvious here?

I see what you mean we can't use cycle based accounting to accrue
Guest time.

See other thread, we can enable this in the config but it still only
works with NO_HZ_FULL.

quoted

With this patch it appears that interrupts running
in host mode are accrued to Guest time, and additional preemption
latency is added.

It is true that interrupt processing in host mode (if they hit on a CPU
when it is running a guest) are accrued to guest time, but without this
patch on current arm64 we accrue no CPU time to guest time at all, which
is hardly more correct.

Yes if only ticks are supported then it's definitely better!

Nevertheless with high interrupt rate it will complicate root causing
issues, a lot of that time will go to guest.

That sounds like a separate fix to me; if there's an existing mechanism
to account for time spent on hw IRQs and it is somehow invalidated by
the PF_VCPU flag being set, then that feels wrong, at least how arm64
works, but it doesn't make this patch less correct.

Tracing through the code (account_system_time()) it appears if the
timer fires while an IRQ runs, tick are accounted to host IRQ
mode (CPUTIME_IRQ). Otherwise it's accrued to guest. Under heavy
interrupt load it's likely guest will mis some ticks, it appears
it's the reverse of what I initially thought but in practice
guest time should be ok as far as ticks go.

quoted

If this patch is incorrect, then how does it work on x86, where
handle_external_intr() is called (with a barrier in between) before
kvm_guest_exit(), and where handle_external_intr() is simply
local_irq_enable() on SVM and something more complicated on VMX ?

Finally, how exactly is preemption latency added here?   Won't IRQ
processing run with higher priority than any task on your system, so the
order of (1) process pending IRQs (2) call schedule if needed is still
preserved here, but we call kvm_guest_exit() between (1) and (2) instead
of before (1).

I may be missing something, but on return from interrupt with preempt
disabled we can't take the need resched path. And need to return
to KVM no?

preempt_enable() will call __preempt_schedule() and cause preemption
there, so you're talking about adding these lines of latency:t

	kvm_guest_exit();
	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));

On return from IRQ this should execute - and el1_preempt won't
get called.

#ifdef CONFIG_PREEMPT
        get_thread_info tsk
        ldr     w24, [tsk, #TI_PREEMPT]         // get preempt count
        cbnz    w24, 1f                         // preempt count != 0
        ldr     x0, [tsk, #TI_FLAGS]            // get flags
        tbz     x0, #TIF_NEED_RESCHED, 1f       // needs rescheduling?
        bl      el1_preempt
1:
#endif

I understand that, but then you call preempt_enable right after which
calls __preempt_schedule() which has the same affect as that asm snippet
you pasted here.

quoted

And these were called with interrupts disabled before, so I don't see
the issue??

However, your question is making me think whether we have a race in the
current code on fully preemptible kernels, if we get preempted before
calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
could potentially schedule another vcpu on this core and loose/corrupt
state, can we not?  We probably need to check for this in
kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
real issue or if I'm seeing ghosts.

Yes appears like it could be an issue in PREEMPT mode.

see separate mail, I don't believe this to be an issue anymore.

quoted

It is entirely possible that I'm missing the mark here and everything
gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
extra logic?

I think something to look into for us, taking a low issue to high level
application - for state machine based type of applications (I guess like
NFV) it cause problems  to root cause issues, a lot of activities
run between ticks. For VM cycle based accounting is probably a must
in that case.

Would you run with NO_HZ_FULL in this case?  Because then we should just
enable HAVE_VIRT_CPU_ACCOUNTING_GEN, and I think that would be a good
start.

It may have a use case to run an isolated vCPU, but in general any mode
may be used (,NO_HZ, even low PERIODIC).

ok, but I still think it would be more correct to have this patch than
not to.

-Christoffer

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help