Re: [PATCH v4 0/9] Inefficient OPPs

From: Lukasz Luba <lukasz.luba@arm.com>
Date: 2021-08-10 12:28:27


On 8/10/21 7:13 AM, Viresh Kumar wrote:

On 04-08-21, 18:21, Rafael J. Wysocki wrote:

quoted

The cpufreq changes are mostly fine by me now, but I would like to
hear from Viresh regarding this.

I have few doubts/concerns here.

Just to iterate it again, the idea here is to choose a higher
frequency, which will work in an efficient manner based on energy
consumption. So this _only_ works for the case where the caller asked
for CPUFREQ_RELATION_L.

- The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
   complete in this sense to me. It should rather be called as
   CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.

- IMO this has made the caller site a bit confusing now, i.e.  why we
   send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
   times. Why shouldn't the _H freq be efficient as well ? I understand
   that this was done to not do the efficient stuff in case of
   userspace/powersave/performance governors.

   What about reusing following for finding all such cases ?

         policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING

   The driver can set a flag to tell if it wants efficient frequencies
   or not, and at runtime we apply efficient algorithm only if the
   current governor does DVFS, which will leave out
   userspace/performance/powersave.


Now the other thing I didn't like since the beginning, I still don't
like it :)

A cpufreq table is provided by the driver, which can have some
inefficient frequencies. The inefficient frequencies can be caught by
many parts of the kernel, the current way (in this series) is via EM.
But this can totally be anything else as well, like a devfreq driver.

Currently devfreq drivers and governors don't support 'inefficient'
OPPs idea. They are not even 'utilization' driven yet. I'm not sure
if that would make sense for their workloads. For now, they are far
behind the CPUFreq/scheduler development in this field.
It needs more research and experiments, to even estimate if this brings
benefits. So, I would just skip devfreq use case...

The way we have tied this whole thing with EM, via
cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
closely bound the whole thing with one of the ways this can be done
and we shouldn't be polluting the cpufreq core with this IMHO. In a
sane world, the cpufreq core should just provide an API, like
cpufreq_set_freq_invariant() and it can be called by any part of
the kernel.

I understand that the current problem is where do we make that call
from and I suggested dev_pm_opp_of_register_em() for the same earlier.
But that doesn't work as the policy isn't properly initialized by that
point.

True, the policy is not initialized yet when cpufreq_driver::init()
callback is called, which the current place for scmi-cpufreq.

What about the other callback cpufreq_driver::ready() ?
This might solve the two issues I mentioned below.

Now that I see the current implementation, I think we can make it work
in a different way.

- Copy what's done for thermal-cooling in cpufreq core, i.e.
   CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
   drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
   can call dev_pm_opp_of_register_em() on their behalf. This call will
   be made from cpufreq_online(), just before/after
   cpufreq_thermal_control_enabled() stuff. By this point the policy is
   properly initialized and is also updated in
   per_cpu(cpufreq_cpu_data).

- Now the cpufreq core can provide an API like
   cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
   be called from EM core's implementation of
   em_dev_register_perf_domain().

I have to point out that there are two issues (which we
might solve):
1. Advanced setup, due to per-CPU performance requests,
which are not limited to real DVFS domain.
The scmi-cpufreq (and possibly some other soon) uses more
tricky EM setup. As you might recall, the construction of temporary
cpumask is a bit complex, since we want per-CPU policy, but the
cpumask pass to EM has not a single bit but is 'spanned' with a few
CPUs.

2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
custom struct em_data_callback, since the power data is coming from FW.

If there is a need for complex EM registration, maybe we could
do it in the cpufreq_driver::ready(). The policy would be ready
during that call, so it might fly.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help