Thread (31 messages) 31 messages, 5 authors, 2021-08-23

Re: [PATCH v4 0/9] Inefficient OPPs

From: "Rafael J. Wysocki" <rafael@kernel.org>
Date: 2021-08-16 14:20:11

On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar [off-list ref] wrote:
On 04-08-21, 18:21, Rafael J. Wysocki wrote:
quoted
The cpufreq changes are mostly fine by me now, but I would like to
hear from Viresh regarding this.
I have few doubts/concerns here.

Just to iterate it again, the idea here is to choose a higher
frequency, which will work in an efficient manner based on energy
consumption. So this _only_ works for the case where the caller asked
for CPUFREQ_RELATION_L.

- The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
  complete in this sense to me. It should rather be called as
  CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
Well, this mostly is a matter of what the flag means.  If "E" implies
"L", I don't see a problem.
- IMO this has made the caller site a bit confusing now, i.e.  why we
  send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
  times. Why shouldn't the _H freq be efficient as well ?
This is a good point, though.
 I understand  that this was done to not do the efficient stuff in case of
  userspace/powersave/performance governors.

  What about reusing following for finding all such cases ?

        policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING

  The driver can set a flag to tell if it wants efficient frequencies
  or not, and at runtime we apply efficient algorithm only if the
  current governor does DVFS, which will leave out
  userspace/performance/powersave.
As long as this can be done without actually accessing
policy->governor->flags on every transition, it sounds like a good
idea.
Now the other thing I didn't like since the beginning, I still don't
like it :)

A cpufreq table is provided by the driver, which can have some
inefficient frequencies. The inefficient frequencies can be caught by
many parts of the kernel, the current way (in this series) is via EM.
But this can totally be anything else as well, like a devfreq driver.

The way we have tied this whole thing with EM, via
cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
closely bound the whole thing with one of the ways this can be done
and we shouldn't be polluting the cpufreq core with this IMHO. In a
sane world, the cpufreq core should just provide an API, like
cpufreq_set_freq_invariant() and it can be called by any part of
the kernel.
Right.
I understand that the current problem is where do we make that call
from and I suggested dev_pm_opp_of_register_em() for the same earlier.
But that doesn't work as the policy isn't properly initialized by that
point.

Now that I see the current implementation, I think we can make it work
in a different way.

- Copy what's done for thermal-cooling in cpufreq core, i.e.
  CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
  drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
  can call dev_pm_opp_of_register_em() on their behalf. This call will
  be made from cpufreq_online(), just before/after
  cpufreq_thermal_control_enabled() stuff. By this point the policy is
  properly initialized and is also updated in
  per_cpu(cpufreq_cpu_data).

- Now the cpufreq core can provide an API like
  cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
  be called from EM core's implementation of
  em_dev_register_perf_domain().
Sounds reasonable to me.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help