Re: [RFC][PATCH 7/7] cpufreq: intel_pstate: Change P-state selection algorithm for Core
From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Date: 2016-08-22 18:53:50
Also in:
lkml
Hi Doug, I am not able to apply this patch. Can you send as a patch on top of Rafael's RFC 7/7. Since test takes long time, I want to apply correct patch. Thanks, Srinivas On Sat, 2016-08-13 at 08:59 -0700, Doug Smythies wrote:
quoted hunk ↗ jump to hunk
On 2016.08.05 17:02 Rafael J. Wysocki wrote:quoted
quoted
On 2016.08.03 21:19 Doug Smythies wrote:quoted
On 2016.07.31 16:49 Rafael J. Wysocki wrote: The PID-base P-state selection algorithm used by intel_pstate for Core processors is based on very weak foundations....[cut]...quoted
+static inline int32_t get_target_pstate_default(struct cpudata *cpu) +{ + struct sample *sample = &cpu->sample; + int32_t busy_frac; + int pstate; + + busy_frac = div_fp(sample->mperf, sample->tsc); + sample->busy_scaled = busy_frac * 100; + + if (busy_frac < cpu->iowait_boost) + busy_frac = cpu->iowait_boost; + + cpu->iowait_boost >>= 1; + + pstate = cpu->pstate.turbo_pstate; + return fp_toint((pstate + (pstate >> 2)) * busy_frac); +} +My previous replies (and see below) have suggested that some filtering is needed on the target pstate, otherwise, and dependant on the type of workload, it tends to oscillate. I added the IIR (Infinite Impulse Response) filter that I have suggested in the past:diff --git a/drivers/cpufreq/intel_pstate.cb/drivers/cpufreq/intel_pstate.c index c43ef55..262ec5f 100644--- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c@@ -98,6 +98,7 @@ static inline u64 div_ext_fp(u64 x, u64 y)* @tsc: Difference of time stamp counter between last and * current sample * @time: Current time from scheduler + * @target: target pstate filtered. * * This structure is used in the cpudata structure to store performance sample * data for choosing next P State.@@ -108,6 +109,7 @@ struct sample {u64 aperf; u64 mperf; u64 tsc; + u64 target; u64 time; };@@ -1168,6 +1170,7 @@ static void intel_pstate_get_cpu_pstates(structcpudata *cpu) pstate_funcs.get_vid(cpu); intel_pstate_set_min_pstate(cpu); + cpu->sample.target = int_tofp(cpu->pstate.min_pstate); } static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)@@ -1301,8 +1304,10 @@ static inline int32_tget_target_pstate_use_performance(struct cpudata *cpu) static inline int32_t get_target_pstate_default(struct cpudata *cpu) { struct sample *sample = &cpu->sample; + int64_t scaled_gain, unfiltered_target; int32_t busy_frac; int pstate; + u64 duration_ns; busy_frac = div_fp(sample->mperf, sample->tsc); sample->busy_scaled = busy_frac * 100;@@ -1313,7 +1318,74 @@ static inline int32_tget_target_pstate_default(struct cpudata *cpu) cpu->iowait_boost >>= 1; pstate = cpu->pstate.turbo_pstate; - return fp_toint((pstate + (pstate >> 2)) * busy_frac); + /* To Do: I think the above should be: + * + * if (limits.no_turbo || limits.turbo_disabled) + * pstate = cpu->pstate.max_pstate; + * else + * pstate = cpu->pstate.turbo_pstate; + * + * figure it out. + * + * no clamps. Pre-filter clamping was needed in past implementations. + * To Do: Is any pre-filter clamping needed here? */ + + unfiltered_target = (pstate + (pstate >> 2)) * busy_frac; + + /* + * Idle check. + * We have a deferrable timer. Very long durations can be + * either due to long idle (C0 time near 0), + * or due to short idle times that spanned jiffy boundaries + * (C0 time not near zero). + * + * To Do: As of the utilization stuff, I do not think the + * spanning jiffy boundaries thing is true anymore. + * Check, and fix the comment. + * + * The very long durations are 0.4 seconds or more. + * Either way, a very long duration will effectively flush + * the IIR filter, otherwise falling edge load response times + * can be on the order of tens of seconds, because this driver + * runs very rarely. Furthermore, for higher periodic loads that + * just so happen to not be in the C0 state on jiffy boundaries, + * the long ago history should be forgotten. + * For cases of durations that are a few times the set sample + * period, increase the IIR filter gain so as to weight + * the current sample more appropriately. + * + * To Do: sample_time should be forced to be accurate. For + * example if the kernel is a 250 Hz kernel, then a + * sample_rate_ms of 10 should result in a sample_time of 12. + * + * To Do: Check that the IO Boost case is not filtered too much. + * It might be that a filter by-pass is needed for the boost case. + * However, the existing gain = f(duration) might be good enough. + */ + + duration_ns = cpu->sample.time - cpu->last_sample_time; + + scaled_gain = div_u64(int_tofp(duration_ns) * + int_tofp(pid_params.p_gain_pct), int_tofp(pid_params.sample_rate_ns)); + if (scaled_gain > int_tofp(100)) + scaled_gain = int_tofp(100); + /* + * This code should not be required, + * but short duration times have been observed + * To Do: Check if this code is actually still needed. I don't think so. + */ + if (scaled_gain < int_tofp(pid_params.p_gain_pct)) + scaled_gain = int_tofp(pid_params.p_gain_pct); + + /* + * Bandwidth limit the output. For now, re-task p_gain_pct for this purpose. + * Use a smple IIR (Infinite Impulse Response) filter. + */ + cpu->sample.target = div_u64((int_tofp(100) - scaled_gain) * + cpu->sample.target + scaled_gain * + unfiltered_target, int_tofp(100)); + + return fp_toint(cpu->sample.target + (1 << (FRAC_BITS-1))); } static inline void intel_pstate_update_pstate(struct cpudata *cpu, int pstate)@@ -1579,6 +1651,7 @@ static void intel_pstate_stop_cpu(structcpufreq_policy *policy) return; intel_pstate_set_min_pstate(cpu); + cpu->sample.target = int_tofp(cpu->pstate.min_pstate); } static int intel_pstate_cpu_init(struct cpufreq_policy *policy) The filter introduces a trade-off between step function load response time and the tendency for the target pstate to oscillate. ...[cut]...quoted
quoted
Several tests were done with this patch set. The patch set would not apply to kernel 4.7, but did apply fine to a 4.7+ kernel (I did as of 7a66ecf) from a few days ago. Test 1: Phoronix ffmpeg test (less time is better): Reason: Because it suffers from rotating amongst CPUs in an odd way, challenging for CPU frequency scaling drivers. This test tends to be an indicator of potential troubles with some games. Criteria: (Dirk Brandewie): Must match or better acpi_cpufreq - ondemand. With patch set: 15.8 Seconds average and 24.51 package watts. Without patch set: 11.61 Seconds average and 27.59 watts. Conclusion: Significant reduction in performance with proposed patch set.With the filter this become even worse at ~18 seconds. I used to fix this by moving the response curve to the left. I have not tested this: + unfiltered_target = (pstate + (pstate >> 1)) * busy_frac; which moves the response curve left a little, yet. I will test it. ...[cut]...quoted
quoted
Test 9: Doug's_specpower simulator (20% load): Time is fixed, less energy is better. Reason: During the long "[intel-pstate driver regression] processor frequency very high even if in idle" and subsequent https://bugzilla.kernel.org/show_bug.cgi?id=115771 discussion / thread(s), some sort of test was needed to try to mimic what Srinivas was getting on his fancy SpecPower test platform. So far at least, this test does that. Only the 20% load case was created, because that was the biggest problem case back then. With patch set: 4 tests at an average of 7197 Joules per test, relatively high CPU frequencies. Without the patch set: 4 tests at an average of 5956 Joules per test, relatively low CPU frequencies. Conclusion: 21% energy regression with the patch set. Note: Newer processors might do better than my older i7-2600K.Patch set + above and IIR gain = 10%: 5670 Joules. Conclusion: Energy regression eliminated. Other gains: gain = 5%: 5342 Joules; Busy MHz: 2172 gain = 10%: 5670 Joules; Busy MHz: 2285 gain = 20%: 6126 Joules; Busy MHz: 2560 gain = 30%: 6426 Joules; Busy MHz: 2739 gain = 40%: 6674 Joules; Busy MHz: 2912 gain = 70%: 7109 Joules; Busy MHz: 3199 locked at minimum pstate (reference): 4653 Joules; Busy MHz: 1600 Performance mode (reference): 7808 Joules; Busy MHz: 3647quoted
quoted
Test 10: measure the frequency response curve, fixed work packet method, 75 hertz work / sleep frequency (all CPU, no IOWAIT): Reason: To compare to some older data and observe overall. png graph NOT attached. Conclusions: Tends to oscillate, suggesting some sort of damping is needed. However, any filtering tends to increase the step function load rise time (see test 11 below, I think there is some wiggle room here). See also graph which has: with and without patch set; performance mode (for reference); Philippe Longepe's cpu_load method also with setpoint 40 (for reference); one of my previous attempts at a load related patch set from quite some time ago (for reference).As expected, the filter damps out the oscillation. New graphs will be sent to Rafael and Srinivas off-list.quoted
quoted
Test 11: Look at the step function load response. From no load to 100% on one CPU (CPU load only, no IO). While there is a graph, it is not attached: Conclusion: The step function response is greatly improved (virtually one sample time max). It would probably be O.K. to slow it down a little with a filter so as to reduce the tendency to oscillate under periodic load conditions (to a point, at least. A low enough frequency will always oscillate) (see the graph for test10).I haven't done this test yet, but from previous work, a gain setting of 10 to 15% gives a load step function response time similar to the current PID based filter. The other tests gave similar results with or without the filter. ... Doug -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html