Thread (10 messages) 10 messages, 2 authors, 2021-01-30

Re: [PATCH] thermal/intel: introduce tcc cooling driver

From: Zhang Rui <rui.zhang@intel.com>
Date: 2021-01-28 17:32:19

Hi, Doug,

On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:
Hi, Just a small follow up on this one:

On 2021.01.16 09:08 Doug Smythies wrote:
quoted
On 2021.01.15 Zhang Rui wrote:
...
quoted
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
67.93   3773    134577  43      54.78

For unknown reason the processor seems to now
think it is not heavily loaded. From my MSR decoder:

0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL

From the book:
quoted
Autonomous Utilization-Based Frequency Control
Status (R0)
When set, frequency is reduced below the operating
system request because the processor has detected
that utilization is low.
Which is not true.

Anyway,

Acked-by: Doug Smythies <redacted>
O.K. there were 2 things wrong above:

1.) I used the wrong intel SDM table for those bit definitions.
They should have been: RATL and RATLL.

From the proper page of the book:
quoted
Running Average Thermal Limit Status (R0)
When set, frequency is reduced below the operating
system request due to Running Average Thermal Limit
(RATL).
2.) Due to the already discussed turbostat issue, that was not
the actual temperature and so the RATL bit being set was actually
valid at that time.
On my side, I got the "Thermal status bit" set.
I have not been able to find the time window knob for this, if there
even is one, similar to the time window knobs for the package power
limits.
I wanted to reduce the time constant, just as a test, in an attempt
to reduce the step function load potential temperature overshoot.
One additional informational follow up note:

There always seems to be a significant turn on transient to using the
TCC offset, appearing as temperature undershoot. I am saying that
an offset of 0 seems to also act as some sort of on/off switch to the
running average.

Example 1 - start with offset 0:

$ sudo ~/turbostat --Summary --quiet --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.17   4600    3531    71      93.57
51.37   4600    3543    71      93.60
51.37   4600    3590    71      93.63  <<< offset changed from 0 to
24
50.99   3737    3566    52      43.49  <<< trip point = 76 degrees
51.20   3700    3550    51      41.14  <<< TCC offset turn on
transient
51.09   3700    3559    51      41.30  <<< There was no need to
throttle
51.12   3779    3515    53      43.78
50.95   4064    3553    58      55.57
51.55   4271    3522    63      65.30
51.18   4424    3534    67      76.58
51.27   4500    3532    68      84.12
51.14   4500    3529    68      84.14
51.24   4599    3522    71      93.61
51.14   4600    3523    71      93.71  <<< Eventually it does return
to not throttled.
Example 2 - start with offset 1:

Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.14   4600    3554    73      94.73
51.37   4600    3544    73      94.85
51.03   4600    3560    74      94.64 <<< offset changed from 1 to 24
51.33   4600    3508    73      94.88 <<< trip point = 76 degrees
51.14   4600    3526    73      94.69 <<< No TCC offset transient
51.22   4600    3614    73      94.85
51.24   4600    3531    73      94.84
51.50   4600    3578    73      94.92
51.15   4600    3571    73      94.77
51.20   4600    3521    73      94.91
51.19   4600    3550    73      94.76
51.27   4600    3522    74      94.81
51.27   4600    3530    74      94.98
Thanks for your test.
I'd prefer this is platform specific. 
Because it behaves really differently from what I observed.

$sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
99.45	2216	10656	80	14.81  <<< start with offset=0
99.48	2234	10621	79	15.02
99.47	2233	10436	80	14.96
99.45	2236	10587	79	14.94
99.49	2216	10673	79	15.04
99.46	2226	10685	79	14.87
99.43	2233	10776	79	14.89
99.73	399	9139	66	4.51   <<< offset set to 50
99.76	212	8998	65	3.31
99.77	212	8902	64	3.27
...                                    <<< throttled for 20 seconds
99.76	212	8911	55	2.97
99.77	211	8851	55	2.95
99.76	211	8916	55	2.94
99.77	211	8844	55	3.05
99.77	211	8828	54	3.21
99.77	211	8911	54	3.05
99.74	212	8998	54	3.20
99.77	212	8802	54	2.90
99.77	211	8849	54	2.90
99.76	212	8942	53	2.98
99.76	211	9039	53	3.22
99.74	212	8977	53	2.89
99.77	211	8913	53	2.89
99.76	212	8900	53	2.89
99.77	211	8817	52	2.87
99.77	212	8923	52	2.88
99.77	212	8985	52	2.88
99.73	212	8877	52	2.86
99.58	575	9308	66	5.54    <<< offset set to 32
98.92	2460	13694	66	17.32
98.98	2298	13768	66	15.24
99.03	2244	14652	66	14.48
98.97	2198	14489	66	13.95
99.03	2148	14583	66	13.43
99.02	2107	14093	66	13.45
99.06	2060	13750	66	12.61
99.06	2036	14195	66	12.27
99.07	2007	14240	66	12.07   
99.12	2888	12147	98	28.23   <<< offset cleared
99.03	3413	11503	98	37.21
98.96	3317	11698	98	34.64
99.07	3246	11410	98	32.89
98.95	3210	12107	98	32.13
98.94	3164	11790	98	31.08
99.00	3124	12106	98	30.84
99.00	3086	11876	98	29.60
98.94	3054	12482	98	29.00
98.89	3030	12629	98	28.54
99.39	2377	10764	82	17.62   <<< Didn't do anything, so it
is probably thermald or something 
99.49	2200	10679	81	14.44
99.52	2211	10267	80	14.66
99.49	2221	10318	80	14.71
99.45	2220	10289	81	14.74
99.43	2222	10326	81	14.76

I tried both tests, and the results are the same, in both cases, it
starts throttling immediately (within a second), and no over-throttling 
observed.

Do you have a script to do this? Say, run turbostat in background and
then change tcc offset at certain timestamp? Maybe we can try exactly
the same test on different machines.

thanks,
rui
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help