Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
From: Shrikanth Hegde <hidden>
Date: 2025-10-30 17:43:43
Also in:
lkml
Hi Sean. On 10/23/25 12:16 AM, Sean Christopherson wrote:
On Tue, Oct 21, 2025, Shrikanth Hegde wrote:quoted
Hi Sean. Thanks for taking time and going through the series. On 10/20/25 8:02 PM, Sean Christopherson wrote:quoted
On Wed, Sep 10, 2025, Shrikanth Hegde wrote:quoted
tl;dr This is follow up of [1] with few fixes and addressing review comments. Upgraded it to RFC PATCH from RFC. Please review. [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/ (local) v2 -> v3: - Renamed to paravirt CPUsThere are myriad uses of "paravirt" throughout Linux and related environments, and none of them mean "oversubscribed" or "contended". I assume Hillf's comments triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat accurate; "paravirt" is wildly misleading.Name has been tricky. We want to have a positive sounding name while conveying that these CPUs are not be used for now due to contention, they may be used again when the contention has gone.I suspect part of the problem with naming is the all-or-nothing approach itself. There's a _lot_ of policy baked into that seemingly simple decision, and thus it's hard to describe with a human-friendly name.
open for suggestions :)
quoted
quoted
quoted
Open issues: - Derivation of hint from steal time is still a challenge. Some work is underway to address it. - Consider kvm and other hypervsiors and how they could derive the hint. Need inputs from community.Bluntly, this series is never going to land, at least not in a form that's remotely close to what is proposed here. This is an incredibly simplistic way of handling overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.Could you describe these complex scenarios?Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores could be overcommitted at any given time, or is far, far too coarse-grained. Very few use cases can distill vCPU scheduling needs and policies into single flag.
Okay. Let me explain whats the current thought process is. On S390 and pseries are the current main use cases. On S390, Z hypervisor provides distinction among vCPUs. vCPU are marked as Vertical High, Vertical Medium and Vertical Low. When there is a steal time it is recommended to use Vertical Highs and avoid using Vertical Lows. In such cases, using this infra, one can avoid scheduling anything on these vertical low vCPUs. Performance benefit is observed since there is less contention and CPU cycles are mainly from Vertical Highs. On PowerVM hypervisor, hypervisor dispatches full core at a time. all SMT=8 siblings are dispatched to the same core always. That means it beneficial to schedule on vCPU siblings together at core level. When there is contention for pCPU full core is preempted. i.e all vCPU belonging to that core would be preempted. In such cases, depending on the configuration of overcommit, and depending on the steal time one could limit the number of cores usage by using limited vCPUs. When done in that way, we see better latency numbers and increase in throughput compared to out-box. The cover letter has those numbers. Now, lets come to KVM with Linux running as Hypervisor. Correct me if i am wrong. each vCPU in KVM will be a process in the host. when vCPU is running, that process will be in running state as well. When there is overcommit and all vCPU are running, there will be more process than number of physical CPUs and host has to context switch and will preempt one vCPU to run another. It can also preempt vCPU to run some host process. If we restrict the number of vCPU where workload is currently running, then number of runnable process in the host also will reduce and less chance of host context switches. Since this avoid any overhead of kvm context save/restore the workload is likely to benefit. I guess it is possible to distinguish between host process and vCPU running as process. If so, host can decide how many threads it can optimally run and give signal to each guest depending on the configuration. Currently keeping it arch dependent, since IMHO it is each Hypervisor is in right place to make decision. Not sure if one fit for all approach works here. Another tricky point is how this signal is going to be. It could be hcall, or vpa area or some shared memory region or using bpf method similar to vCPU boosting patch series. There too, i think it is best to leave to arch to specify how. the reason being bpf method will not work for powerVM hypervisors.
E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively running, and the host has a non-vCPU task that _must_ run, then the host will need to preempt a vCPU task. Ideally, a paravirtualized scheduling system would allow
Host/Hypervsior need not make the vCPU as "Not use" every single time it preempts. It needs to do so, only when there are more vCPU processes than number of physical CPUS and preemption is happening between vCPU process. There would be corner cases such as only one physical process is there, and two KVM each with one vCPU, then nothing much can be done.
the host to make an informed decision when choosing which vCPU to preempt, e.g. to minimize disruption to the guest(s).