Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

From: Michal Suchánek <hidden>
Date: 2021-04-25 11:07:20
Also in: linux-pm

On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:

* Michal Such?nek [off-list ref] [2021-04-23 20:42:16]:

quoted

On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:

quoted

* Michal Such?nek [off-list ref] [2021-04-23 19:45:05]:

quoted

On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:

quoted

* Michal Such?nek [off-list ref] [2021-04-23 09:35:51]:

quoted

On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:

quoted

From: "Gautham R. Shenoy" <redacted>

Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform

On some of the POWER9 LPARs, the older firmwares advertise a very low
value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the

Can you be more specific about 'older firmwares'?

Hi Michal,

This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
key idea behind the original patch was to make the H_CEDE latency and
hence target residency come from firmware instead of being decided by
the kernel.  The advantage is such that, different type of systems in
POWER10 generation can adjust this value and have an optimal H_CEDE
entry criteria which balances good single thread performance and
wakeup latency.  Further we can have additional H_CEDE state to feed
into the cpuidle.

So all POWER9 machines are affected by the firmware bug where firmware
reports CEDE1 exit latency of 2us and the real latency is 5us which
causes the kernel to prefer CEDE1 too much when relying on the values
supplied by the firmware. It is not about 'older firmware'.

Correct.  All POWER9 systems running Linux as guest LPARs will see
extra usage of CEDE idle state, but not baremetal (PowerNV).

The correct definition of the bug or miss-match in expectation is that
firmware reports wakeup latency from a core/thread wakeup timing, but
not end-to-end time from sending a wakeup event like an IPI using
H_calls and receiving the events on the target.  Practically there are
few extra micro-seconds needed after deciding to wakeup a target
core/thread to getting the target to start executing instructions
within the LPAR instance.

Thanks for the detailed explanation.

Maybe just adding a few microseconds to the reported time would be a
more reasonable workaround than using a blanket fixed value then.

Yes, that is an option.  But that may only reduce the difference
between existing kernel and new kernel unless we make it the same
number.  Further we are fixing this in P10 and hence we will have to
add "if(P9) do the compensation" and otherwise take it as is.  That
would not be elegant.  Given that our goal for P9 platform is to not
introduce changes in H_CEDE entry behaviour, we arrived at this
approach (this small patch) and this also makes it easy to backport to
various distro products.

I don't see how this is more elegent.

The current patch is

if(p9)
	use fixed value

the suggested patch is

if(p9)
	apply compensation

That is either will add one branch for the affected platform.

But I understand if you do not have confidence that the compensation is
the same in all cases and do not have the opportunity to measure it it
may be simpler to apply one very conservative adjustment.

Thanks

Michal

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help