Thread (30 messages) 30 messages, 5 authors, 2020-08-06

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

From: Francisco Jerez <hidden>
Date: 2020-07-21 23:14:49
Also in: linux-pm, lkml

Srinivas Pandruvada [off-list ref] writes:
On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
quoted
"Rafael J. Wysocki" [off-list ref] writes:
quoted
On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
currojerez@riseup.net> wrote:
quoted
"Rafael J. Wysocki" [off-list ref] writes:
{...]
quoted
quoted
Overall, so far, I'm seeing a claim that the CPU subsystem can be
made
use less energy and do as much work as before (which is what
improving
the energy-efficiency means in general) if the maximum frequency of
CPUs is limited in a clever way.

I'm failing to see what that clever way is, though.
Hopefully the clarifications above help some.
To simplify:

Suppose I called a function numpy.multiply() to multiply two big arrays
and thread is a pegged to a CPU. Let's say it is causing CPU to
finish the job in 10ms and it is using a P-State of 0x20. But the same
job could have been done in 10ms even if it was using P-state of 0x16.
So we are not energy efficient. To really know where is the bottle neck
there are numbers of perf counters, may be cache was the issue, we
could rather raise the uncore frequency a little. A simple APRF,MPERF
counters are not enough. 
Yes, that's right, APERF and MPERF aren't sufficient to identify every
kind of possible bottleneck, some visibility of the utilization of other
subsystems is necessary in addition -- Like e.g the instrumentation
introduced in my series to detect a GPU bottleneck.  A bottleneck
condition in an IO device can be communicated to CPUFREQ by adjusting a
PM QoS latency request (link [2] in my previous reply) that effectively
gives the governor permission to rearrange CPU work arbitrarily within
the specified time frame (which should be of the order of the natural
latency of the IO device -- e.g. at least the rendering time of a frame
for a GPU) in order to minimize energy usage.
or we characterize the workload at different P-states and set limits.
I think this is not you want to say for energy efficiency with your
changes. 

The way you are trying to improve "performance" is by caller (device
driver) to say how important my job at hand. Here device driver suppose
offload this calculations to some GPU and can wait up to 10 ms, you
want to tell CPU to be slow. But the p-state driver at a movement
observes that there is a chance of overshoot of latency, it will
immediately ask for higher P-state. So you want P-state limits based on
the latency requirements of the caller. Since caller has more knowledge
of latency requirement, this allows other devices sharing the power
budget to get more or less power, and improve overall energy efficiency
as the combined performance of system is improved.
Is this correct?
Yes, pretty much.

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help