Re: [PATCH v4 0/9] Inefficient OPPs
From: "Rafael J. Wysocki" <rafael@kernel.org>
Date: 2021-08-16 14:20:11
On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar [off-list ref] wrote:
On 04-08-21, 18:21, Rafael J. Wysocki wrote:quoted
The cpufreq changes are mostly fine by me now, but I would like to hear from Viresh regarding this.I have few doubts/concerns here. Just to iterate it again, the idea here is to choose a higher frequency, which will work in an efficient manner based on energy consumption. So this _only_ works for the case where the caller asked for CPUFREQ_RELATION_L. - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel complete in this sense to me. It should rather be called as CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
Well, this mostly is a matter of what the flag means. If "E" implies "L", I don't see a problem.
- IMO this has made the caller site a bit confusing now, i.e. why we send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other times. Why shouldn't the _H freq be efficient as well ?
This is a good point, though.
I understand that this was done to not do the efficient stuff in case of
userspace/powersave/performance governors.
What about reusing following for finding all such cases ?
policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
The driver can set a flag to tell if it wants efficient frequencies
or not, and at runtime we apply efficient algorithm only if the
current governor does DVFS, which will leave out
userspace/performance/powersave.As long as this can be done without actually accessing policy->governor->flags on every transition, it sounds like a good idea.
Now the other thing I didn't like since the beginning, I still don't like it :) A cpufreq table is provided by the driver, which can have some inefficient frequencies. The inefficient frequencies can be caught by many parts of the kernel, the current way (in this series) is via EM. But this can totally be anything else as well, like a devfreq driver. The way we have tied this whole thing with EM, via cpufreq_read_inefficiencies_from_em(), is what I don't like. We have closely bound the whole thing with one of the ways this can be done and we shouldn't be polluting the cpufreq core with this IMHO. In a sane world, the cpufreq core should just provide an API, like cpufreq_set_freq_invariant() and it can be called by any part of the kernel.
Right.
I understand that the current problem is where do we make that call from and I suggested dev_pm_opp_of_register_em() for the same earlier. But that doesn't work as the policy isn't properly initialized by that point. Now that I see the current implementation, I think we can make it work in a different way. - Copy what's done for thermal-cooling in cpufreq core, i.e. CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core can call dev_pm_opp_of_register_em() on their behalf. This call will be made from cpufreq_online(), just before/after cpufreq_thermal_control_enabled() stuff. By this point the policy is properly initialized and is also updated in per_cpu(cpufreq_cpu_data). - Now the cpufreq core can provide an API like cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can be called from EM core's implementation of em_dev_register_perf_domain().
Sounds reasonable to me.