Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support
From: Alexander Graf <hidden>
Date: 2012-01-10 03:11:56
Also in:
kvm
On 10.01.2012, at 01:51, Scott Wood wrote:
On 01/09/2012 11:46 AM, Alexander Graf wrote:quoted
=20 On 21.12.2011, at 02:34, Scott Wood wrote: =20quoted
Chips such as e500mc that implement category E.HV in Power ISA 2.06 provide hardware virtualization features, including a new MSR mode =
for
quoted
quoted
guest state. The guest OS can perform many operations without =
trapping
quoted
quoted
into the hypervisor, including transitions to and from guest =
userspace.
quoted
quoted
=20 Since we can use SRR1[GS] to reliably tell whether an exception came =
from
quoted
quoted
guest state, instead of messing around with IVPR, we use DO_KVM =
similarly
quoted
quoted
to book3s.=20 Is there any benefit of using DO_KVM? I would assume that messing =
with IVPR is faster.
=20 Using the GS bit to decide which handler to run means we won't get confused if a machine check or critical interrupt happens between entering/exiting the guest and updating IVPR (we could use the IS bit similarly in PR-mode). =20 This could be supplemented with IVPR (though that will add a few =
cycles
to guest entry/exit) or some sort of runtime patching (would be more coarse-grained, active when any KVM guest exists) to avoid adding overhead to traps when KVM is not used, but I'd like to quantify that overhead first. It should be much lower than what happens on book3s.
Hrm. Yeah, given that your DO_KVM handler is so much simpler, it might = make sense to stick with that method. Benchmarks would be useful in the = long run though.
=20quoted
quoted
Current issues include: - Machine checks from guest state are not routed to the host =
handler.
quoted
quoted
- The guest can cause a host oops by executing an emulated =
instruction
quoted
quoted
in a page that lacks read permission. Existing e500/4xx support =
has
quoted
quoted
the same problem.=20 We solve that in book3s pr by doing =20 LAST_INST =3D <known bad value>; PACA->kvm_mode =3D <recover at next inst>; lwz(guest pc); do_more_stuff(); =20 That way when an exception occurs at lwz() the DO_KVM handler checks =
that we're in kvm mode "recover" which does basically srr0+=3D4; rfi;.
=20 I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and treat it as a kernel fault (search exception table) -- but this works too and is a bit cleaner (could be other uses of external pid), at the expense of a couple extra instructions in the emulation path (but probably a slightly faster host TLB handler). =20 The check wouldn't go in DO_KVM, though, since on bookehv that only deals with diverting flow when xSRR1[GS] is set, which wouldn't be the case here.
Yup, not sure where you'd put the check, as it'd slow down normal = operation too. Hrm.
=20quoted
quoted
@@ -243,16 +324,20 @@ static int kvmppc_booke_irqprio_deliver(struct =
kvm_vcpu *vcpu,
quoted
quoted
case BOOKE_IRQPRIO_AP_UNAVAIL: case BOOKE_IRQPRIO_ALIGNMENT: allowed =3D 1; - msr_mask =3D MSR_CE|MSR_ME|MSR_DE; + msr_mask =3D MSR_GS | MSR_CE | MSR_ME | MSR_DE;=20 No need to do this. You already force MSR_GS in set_msr();=20 OK. This was here since before set_msr() started doing that. :-) =20quoted
quoted
+ if (!current->thread.kvm_vcpu) { + WARN(1, "no vcpu\n"); + return -EPERM; + }=20 Huh?=20 Oops, leftover debugging. =20quoted
quoted
+static int emulation_exit(struct kvm_run *run, struct kvm_vcpu =
*vcpu)
quoted
quoted
+{ + enum emulation_result er; + + er =3D kvmppc_emulate_instruction(run, vcpu); + switch (er) { + case EMULATE_DONE: + /* don't overwrite subtypes, just account kvm_stats */ + kvmppc_account_exit_stat(vcpu, EMULATED_INST_EXITS); + /* Future optimization: only reload non-volatiles if + * they were actually modified by emulation. */ + return RESUME_GUEST_NV; + + case EMULATE_DO_DCR: + run->exit_reason =3D KVM_EXIT_DCR; + return RESUME_HOST; + + case EMULATE_FAIL: + /* XXX Deliver Program interrupt to guest. */ + printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n", + __func__, vcpu->arch.regs.nip, =
vcpu->arch.last_inst);
quoted
=20 This should be throttled, otherwise the guest can spam our logs.=20 Yes it should, but I'm just moving the code here.
Yeah, only realized this later. Maybe next time (not for this patch set, = next time you're sending something) just extract these mechanical parts, = so it's easier to review the pieces where code actually changes :).
=20quoted
quoted
+ /* For debugging, encode the failing instruction and + * report it to userspace. */ + run->hw.hardware_exit_reason =3D ~0ULL << 32; + run->hw.hardware_exit_reason |=3D vcpu->arch.last_inst;=20 =20 I'm fairly sure you want to fix this :)=20 Likewise, that's what booke.c already does. What should it do =
instead?
This is what book3s does:
case EMULATE_FAIL:
printk(KERN_CRIT "%s: emulation at %lx failed =
(%08x)\n",
__func__, kvmppc_get_pc(vcpu), =
kvmppc_get_last_inst(vcpu));
kvmppc_core_queue_program(vcpu, flags);
r =3D RESUME_GUEST;
which also doesn't throttle the printk, but I think injecting a program =
fault into the guest is the most sensible thing to do if we don't know =
what the instruction is supposed to do. Best case we get an oops inside =
the guest telling us what broke :).
=20quoted
/**quoted
* kvmppc_handle_exit *@@ -374,12 +530,39 @@ out:int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned int exit_nr) { - enum emulation_result er; int r =3D RESUME_HOST; =20 /* update before a new last_exit_type is rewritten */ kvmppc_update_timing_stats(vcpu); =20 + /* + * If we actually care, we could copy MSR, DEAR, and ESR to =
regs,
quoted
quoted
+ * insert an appropriate trap number, etc. + * + * Seems like a waste of cycles for something that should only =
matter
quoted
quoted
+ * to someone using sysrq-t/p or similar host kernel debug =
facility.
quoted
quoted
+ * We have other debug facilities to get that information from a + * guest through userspace. + */ + switch (exit_nr) { + case BOOKE_INTERRUPT_EXTERNAL: + do_IRQ(&vcpu->arch.regs);=20 Ah, so that's what you want to use regs for. So is having a pt_regs struct that only contains useful register values in half its fields any useful here? Or could we keep control of the registers ourselves, enabling us to maybe one day optimize things more.=20 I think it contains enough to be useful for debugging code such as =
sysrq
and tracers, and as noted in the comment we could copy the rest if we care enough. MSR might be worth copying. =20 It will eventually be used for machine checks as well, which I'd like =
to
hand reasonable register state to, at least for GPRs, LR, and PC. =20 If there's a good enough performance reason, we could just copy everything over for machine checks and pass NULL to do_IRQ (I think it can take this -- a dummy regs struct if not), but it seems premature =
at
the moment unless the switch already causes measured performance loss (cache utilization?).
I'm definitely not concerned about performance, but complexity and = uniqueness. With the pt_regs struct, we have a bunch of fields in the = vcpu that are there, but unused. I find that situation pretty confusing. So yes, I would definitely prefer to copy registers during MC and keep = the registers where they are today - unless there are SPRs for them of = course. Imagine we'd one day want to share GPRs with user space through the = kvm_run structure (see the s390 patches on the ML for this). I really = wouldn't want to make pt_regs part of our userspace ABI.
=20quoted
quoted
@@ -387,30 +570,56 @@ int kvmppc_handle_exit(struct kvm_run *run, =
struct kvm_vcpu *vcpu,
quoted
quoted
=20 switch (exit_nr) { case BOOKE_INTERRUPT_MACHINE_CHECK: - printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR)); - kvmppc_dump_vcpu(vcpu); - r =3D RESUME_HOST; + kvm_resched(vcpu); + r =3D RESUME_GUEST;=20 huh?=20 Patch shuffling accident -- this belongs with a later patch that =
invokes
the host machine check handler similar to what is done with do_IRQ(). The host machine check handler needs some work first, though. =20quoted
quoted
break; =20 case BOOKE_INTERRUPT_EXTERNAL: kvmppc_account_exit(vcpu, EXT_INTR_EXITS); - if (need_resched()) - cond_resched(); + kvm_resched(vcpu);=20 Why are we explicit about the resched? On book3s I just call =
kvm_resched(vcpu) before the switch().
=20 There are a few exit types where we don't currently do the resched -- =
if
they're all bugs or don't-cares, we could move it out of the switch. =20 We probably should defer the check until after we've disabled interrupts, similar to signals -- even if we didn't exit for an interrupt, we could have received one after enabling them.
Yup. I just don't think you can call resched() with interrupts disabled, = so a bit cleverness is probably required here.
=20quoted
quoted
+ if (kvm_is_visible_gfn(vcpu->kvm, gfn)) { + /* The guest TLB had a mapping, but the shadow =
TLB
quoted
quoted
+ * didn't. This could be because: + * a) the entry is mapping the host kernel, or + * b) the guest used a large mapping which we're =
faking
quoted
quoted
+ * Either way, we need to satisfy the fault =
without
quoted
quoted
+ * invoking the guest. */ + kvmppc_mmu_map(vcpu, eaddr, gpaddr, gtlb_index); + } else { + /* Guest mapped and leaped at non-RAM! */ + kvmppc_booke_queue_irqprio(vcpu, + =
BOOKE_IRQPRIO_MACHINE_CHECK);
quoted
=20 Are you sure? Couldn't this also be MMIO? That doesn't really improve =
the situation as executing from MMIO is tricky with the KVM model, but = it's not necessarily bad. Oh well, I guess we'll have to do something = and throwing an #MC isn't all that ugly.
=20 I think I asked you about executing from MMIO once, and you said it wasn't supported even in straight QEMU. Have things changed?
Yeah, I talked to Anthony about that part and apparently the QEMU design = does support execution from MMIO. But don't worry about it for now. I = don't think we'll really have guest OSs doing this. And if they do, we = can worry about it then.
=20quoted
quoted
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h index 05d1d99..d53bcf2 100644 --- a/arch/powerpc/kvm/booke.h +++ b/arch/powerpc/kvm/booke.h@@ -48,7 +48,20 @@#define BOOKE_IRQPRIO_PERFORMANCE_MONITOR 19 /* Internal pseudo-irqprio for level triggered externals */ #define BOOKE_IRQPRIO_EXTERNAL_LEVEL 20 -#define BOOKE_IRQPRIO_MAX 20 +#define BOOKE_IRQPRIO_DBELL 21 +#define BOOKE_IRQPRIO_DBELL_CRIT 22 +#define BOOKE_IRQPRIO_MAX 23=20 So was MAX wrong before or is it too big now?=20 MAX is just a marker for how many IRQPRIOs we have, not any sort of external limit. This patch adds new IRQPRIOs, so MAX goes up. =20 The actual limit is the number of bits in a long.
Yes, and before the highest value was 20 with MAX being 20, now the = highest value is 22 with MAX being 23. Either MAX =3D=3D highest number = or MAX =3D=3D highest number + 1, but you're changing the semantics of = MAX here. Maybe it was wrong before, I don't know, hence I'm asking :).
=20quoted
quoted
+ .if \flags & NEED_EMU + lwz r9, VCPU_KVM(r4)=20 writing r9 =20quoted
+ .endif + +#ifdef CONFIG_KVM_EXIT_TIMING + /* save exit time */ +1: mfspr r7, SPRN_TBRU + mfspr r8, SPRN_TBRL + mfspr r9, SPRN_TBRU=20 overwriting r9 again?=20 Oops. It's RFC for a reason. :-) =20quoted
quoted
+#ifndef CONFIG_64BIT=20 Double negation is always hard to read. Please reverse the ifdef :)=20 OK. =20quoted
quoted
+lightweight_exit: + PPC_STL r2, HOST_R2(r1) + + mfspr r3, SPRN_PID + stw r3, VCPU_HOST_PID(r4) + lwz r3, VCPU_GUEST_PID(r4) + mtspr SPRN_PID, r3 + + /* Save vcpu pointer for the exception handlers + * must be done before loading guest r2. + */ +// SET_VCPU(r4)=20 hm?=20 Can just be removed, it's handled in booke's vcpu load/put. =20quoted
quoted
+ lwz r6, (VCPU_SHARED_MAS2 + 4)(r11) +#else + ld r6, (VCPU_SHARED_MAS2)(r11) +#endif + lwz r7, VCPU_SHARED_MAS7_3+4(r11) + lwz r8, VCPU_SHARED_MAS4(r11) + mtspr SPRN_MAS0, r3 + mtspr SPRN_MAS1, r5 + mtspr SPRN_MAS2, r6 + mtspr SPRN_MAS3, r7 + mtspr SPRN_MAS4, r8 + lwz r3, VCPU_SHARED_MAS6(r11) + lwz r5, VCPU_SHARED_MAS7_3+0(r11) + mtspr SPRN_MAS6, r3 + mtspr SPRN_MAS7, r5 + /* Disable MAS register updates via exception */ + mfspr r3, SPRN_EPCR + oris r3, r3, SPRN_EPCR_DMIUH@h + mtspr SPRN_EPCR, r3=20 Shouldn't this happen before you set the MAS registers? :)=20 Yes (though we really shouldn't be getting a TLB miss here, at least =
on
e500mc).
Yeah, but the way it's now it gives you a false feeling of security :)
=20quoted
quoted
+ /* Load some guest volatiles. */ + PPC_LL r3, VCPU_LR(r4) + PPC_LL r5, VCPU_XER(r4) + PPC_LL r6, VCPU_CTR(r4) + PPC_LL r7, VCPU_CR(r4) + PPC_LL r8, VCPU_PC(r4) +#ifndef CONFIG_64BIT + lwz r9, (VCPU_SHARED_MSR + 4)(r11) +#else + ld r9, (VCPU_SHARED_MSR)(r11) +#endif + PPC_LL r0, VCPU_GPR(r0)(r4) + PPC_LL r1, VCPU_GPR(r1)(r4) + PPC_LL r2, VCPU_GPR(r2)(r4) + PPC_LL r10, VCPU_GPR(r10)(r4) + PPC_LL r11, VCPU_GPR(r11)(r4) + PPC_LL r12, VCPU_GPR(r12)(r4) + PPC_LL r13, VCPU_GPR(r13)(r4) + mtlr r3 + mtxer r5 + mtctr r6 + mtcr r7 + mtsrr0 r8 + mtsrr1 r9=20 Are you sure this should be shared->msr, not shadow_msr?=20 Yes, we don't use shadow_msr on bookehv. I'll add a comment in the struct definition as discussed in the other thread, as well as other areas where there are subtle differences between PR-mode and GS-mode.
Thanks! Alex