[PATCH 0/1] KVM: powerpc/book3s_hv: Handle deferred CFS bandwidth throttle on guest re-entry
From: Vishal Chourasia <hidden>
Date: 2026-06-26 10:55:55
Also in:
kvm, lkml
This series fixes a KVM scheduling bug on Book3S HV (POWER8/POWER9/POWER10)
where a guest VM under a cpu.max bandwidth limit can run arbitrarily past its
quota and then appear completely frozen for minutes afterwards.
== Background ==
Commit 2cd571245b43 ("sched/fair: Add related data structure for task based
throttle"), merged in v6.18, changed how CFS bandwidth throttling enforces
its limit. Previously, throttle_cfs_rq() dequeued tasks directly. Under the
new scheme it queues a task_work item via task_work_add(..., TWA_RESUME),
sets TIF_NOTIFY_RESUME, and relies on that work running on the kernel return
path to actually dequeue the task.
For KVM guests this means the work must be drained before each guest entry,
not just on the normal syscall return path. commit 935ace2fb5cc ("entry:
Provide infrastructure for work before transitioning to guest mode")
introduced kvm_xfer_to_guest_mode_handle_work() for exactly this purpose.
x86 (commit 72c3c0fe54a3), arm64 (commit 6caa5812e2d1), riscv, s390, and
loongarch all adopted it. Book3S HV did not. [1]
== Root Cause ==
Book3S HV's vCPU run loops — kvmhv_run_single_vcpu() for POWER9+ and
kvmppc_run_vcpu() for pre-POWER9 — only test TIF_SIGPENDING and
TIF_NEED_RESCHED before re-entering the guest. TIF_NOTIFY_RESUME is never
checked, and the deferred throttle task_work therefore never runs while a
vCPU is inside the run loop.
For a CPU-bound guest that generates few KVM exits back to QEMU user space
(e.g. a compute-heavy or busy-looping workload), the vCPU thread never
returns to user mode. throttle_cfs_rq() sets cfs_rq->throttled = 1 and
queues the task_work, but the guest continues to run unchecked.
cfs_rq->runtime_remaining goes increasingly negative with every scheduling
period while the throttle flag sits ignored.
The only mechanism recovering that debt is the periodic bandwidth timer
replenishment: 30 ms of quota is added per 100 ms period. When
runtime_remaining has drifted hundreds of seconds negative, recovering to
zero at 300 ms/s takes minutes — during which the cgroup is legitimately
throttled and the VM is completely frozen once it finally exits to user
space.
== Debugging ==
vCPU was placed in a cgroup where CPU bandwidth limits were set.
quota = 30ms
period = 100ms
The bug was diagnosed using a bpftrace script probing throttle_cfs_rq()
and unthrottle_cfs_rq() and sampling cfs_rq->runtime_remaining every
second. The trace shows the debt accumulation phase, the slow recovery
phase, and the immediate re-throttle on resumption:
Debt accumulation (vCPU in guest, no exits):
+1471 s runtime_remaining=-209702865115 ns throttled=1
+1472 s runtime_remaining=-210402866357 ns throttled=1
... # ~-700 ms/s (growing debt)
+1477 s runtime_remaining=-213902833931 ns throttled=1
Recovery (vCPU exits to QEMU user space; bandwidth timer replenishes):
+1478 s runtime_remaining=-213617443453 ns throttled=1
+1479 s runtime_remaining=-213317443453 ns throttled=1
... # ~+300 ms/s (30ms quota/100ms)
After ~710 seconds of recovery, debt reaches zero:
──── unthrottle_cfs_rq @ cpu=768 +2190.029568131 s ────
runtime_remaining = 1 ns # just crossed zero
The vCPU immediately re-enters the guest and over-runs its quota again:
──── throttle_cfs_rq @ cpu=768 +2190.055327252 s ────
runtime_remaining = -5667293 ns # 26 ms of debt already
The cycle then repeats identically from a fresh -700 ms/s accumulation.
cpu.stat confirms the pathology — 100% throttle rate and virtually all
CPU time accumulated in kernel (KVM) mode:
nr_periods = 117457
nr_throttled = 117457 # every single period
system_usec = 4334782636 # >99.99% kernel time (QEMU in KVM_RUN)
strace of the QEMU vCPU thread confirms long stretches where
ioctl(KVM_RUN) does not return — the vCPU is running in guest mode
with no VM-exits reaching user space.
== Fix Summary ==
Opt Book3S HV into VIRT_XFER_TO_GUEST_WORK and drain pending guest-mode
work (including the deferred CFS throttle task_work) on every guest
re-entry in both run loops. The changes are supersets of the existing
need_resched() checks and do not alter the signal or exit accounting.
[1] https://lore.kernel.org/all/20250421102837.78515-2-sshegde@linux.ibm.com/ (local)
Vishal Chourasia (1):
KVM: powerpc/book3s_hv: Use generic xfer to guest work function
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_hv.c | 58 +++++++++++++++++++++++++++++++-----
2 files changed, 52 insertions(+), 7 deletions(-)
--
2.54.0