[PATCH v3 00/41] Optimize KVM/ARM for VHE systems
From: Christoffer Dall <hidden>
Date: 2018-01-18 13:32:24
Also in:
kvm, kvmarm
On Thu, Jan 18, 2018 at 03:18:21PM +0300, Yury Norov wrote:
On Thu, Jan 18, 2018 at 12:16:32PM +0100, Christoffer Dall wrote:quoted
Hi Yury, [cc'ing Alex Bennee who had some thoughts on this] On Mon, Jan 15, 2018 at 05:14:23PM +0300, Yury Norov wrote:quoted
On Fri, Jan 12, 2018 at 01:07:06PM +0100, Christoffer Dall wrote:quoted
This series redesigns parts of KVM/ARM to optimize the performance on VHE systems. The general approach is to try to do as little work as possible when transitioning between the VM and the hypervisor. This has the benefit of lower latency when waiting for interrupts and delivering virtual interrupts, and reduces the overhead of emulating behavior and I/O in the host kernel. Patches 01 through 06 are not VHE specific, but rework parts of KVM/ARM that can be generally improved. We then add infrastructure to move more logic into vcpu_load and vcpu_put, we improve handling of VFP and debug registers. We then introduce a new world-switch function for VHE systems, which we can tweak and optimize for VHE systems. To do that, we rework a lot of the system register save/restore handling and emulation code that may need access to system registers, so that we can defer as many system register save/restore operations to vcpu_load and vcpu_put, and move this logic out of the VHE world switch function. We then optimize the configuration of traps. On non-VHE systems, both the host and VM kernels run in EL1, but because the host kernel should have full access to the underlying hardware, but the VM kernel should not, we essentially make the host kernel more privileged than the VM kernel despite them both running at the same privilege level by enabling VE traps when entering the VM and disabling those traps when exiting the VM. On VHE systems, the host kernel runs in EL2 and has full access to the hardware (as much as allowed by secure side software), and is unaffected by the trap configuration. That means we can configure the traps for VMs running in EL1 once, and don't have to switch them on and off for every entry/exit to/from the VM. Finally, we improve our VGIC handling by moving all save/restore logic out of the VHE world-switch, and we make it possible to truly only evaluate if the AP list is empty and not do *any* VGIC work if that is the case, and only do the minimal amount of work required in the course of the VGIC processing when we have virtual interrupts in flight. The patches are based on v4.15-rc3, v9 of the level-triggered mapped interrupts support series [1], and the first five patches of James' SDEI series [2]. I've given the patches a fair amount of testing on Thunder-X, Mustang, Seattle, and TC2 (32-bit) for non-VHE testing, and tested VHE functionality on the Foundation model, running both 64-bit VMs and 32-bit VMs side-by-side and using both GICv3-on-GICv3 and GICv2-on-GICv3. The patches are also available in the vhe-optimize-v3 branch on my kernel.org repository [3]. The vhe-optimize-v3-base branch contains prerequisites of this series. Changes since v2: - Rebased on v4.15-rc3. - Includes two additional patches that only does vcpu_load after kvm_vcpu_first_run_init and only for KVM_RUN. - Addressed review comments from v2 (detailed changelogs are in the individual patches). Thanks, -Christoffer [1]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git level-mapped-v9 [2]: git://linux-arm.org/linux-jm.git sdei/v5/base [3]: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git vhe-optimize-v3I tested this v3 series on ThunderX2 with IPI benchmark: https://lkml.org/lkml/2017/12/11/364 I tried to address your comments in discussion to v2, like pinning the module to specific CPU (with taskset), increasing the number of iterations, tuning governor to max performance. Results didn't change much, and are pretty stable. Comparing to vanilla guest, Norml IPI delivery for v3 is 20% slower. For v2 it was 27% slower, and for v1 - 42% faster. What's interesting, the acknowledge time is much faster for v3, so overall time to deliver and acknowledge IPI (2nd column) is less than vanilla 4.15-rc3 kernel. Test setup is not changed since v2: ThunderX2, 112 online CPUs, guest is running under qemu-kvm, emulating gic version 3. Below is test results for v1-3 normalized to host vanilla kernel dry-run time. Yury Host, v4.14: Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 81 110 Broadcast IPI: 0 2106 Guest, v4.14: Dry-run: 0 1 Self-IPI: 10 18 Normal IPI: 305 525 Broadcast IPI: 0 9729 Guest, v4.14 + VHE: Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 176 343 Broadcast IPI: 0 9885 And for v2. Host, v4.15: Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 79 108 Broadcast IPI: 0 2102 Guest, v4.15-rc: Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 291 526 Broadcast IPI: 0 10439 Guest, v4.15-rc + VHE: Dry-run: 0 2 Self-IPI: 14 28 Normal IPI: 370 569 Broadcast IPI: 0 11688 And for v3. Host 4.15-rc3 Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 80 110 Broadcast IPI: 0 2088 Guest, 4.15-rc3 Dry-run: 0 1 Self-IPI: 9 18 Normal IPI: 289 497 Broadcast IPI: 0 9999 Guest, 4.15-rc3 + VHE Dry-run: 0 2 Self-IPI: 12 24 Normal IPI: 347 490 Broadcast IPI: 0 11906So, I had a look at your measurement code, and just want to make a sanity check that I understand the measurements correctly. Firstly, if we execute something 100,000 times and summarize the result for each run, and get anything less than 100,000 (in this case ~300), without scaling the value, doesn't that mean that in the vast majority of cases, you are getting 0 as your measurement?I cannot report absolute numbers so I posted normalized values to dry-run case. 300 for IPI delivery means that it 300 times slower than no-op (dry-run case). Absolute numbers looks quite reasonable, few useconds for normal IPI.
Ah, I see, you normalized it after the output from your benchmark. I thought you normalized it in the benchmark code originally, but then I didn't see it in the patch you linked to, so wasn't sure what was going on.
Let me know if you need absolute numbers. https://lkml.org/lkml/2017/12/13/301
I trust you, that's fine.
quoted
Secondly, are we sure all the required memory barriers are in place? I know that the IPI send contains an smp_wmb(), but when you read back the value in the caller, do you have the necessary smp_wmb() on the handler side and a corresponding smp_rmb() on the sending side? I'm not sure what kind of effect missing barriers for a measurement framework like this would have, but it's worth making sure we're not chasing red herrings here.I don't share memory between PMUs.
PMUs? You do share memory between your CPUs, it's the little piece of memory that your time variable points to. I was concerned if the read back from your sender CPU of the value written by the receiving CPU was properly ordered, but looking at handle_IPI and smp_call_function_single, there are barriers pretty much all over, and I don't think a missing barrier would result in what we see here (given that I understand the normalization above).
quoted
That obviously doesn't change that the overall turnaround time is improved more in the v1 case than in the v3 case, which I'd like to explore/bisect in any case.So me. For any idea, let me know, I'll check it.
So another thing that would be very useful (which I would do myself if I had access to a TX2) would be to simply bisect the series and run the benchmark and see where the regression is introduced. In case you have time for that, I have a bisectable series with the recent KVM/ARM fixes in the 'vhe-optimize-v3-with-fixes' branch on: git://git.kernel.org/pub/scm/linux/kernel/git/cdall/linux.git Thanks, -Christoffer