[PATCHv3 0/5] coupled cpuidle state support

From: Rafael J. Wysocki <hidden>
Date: 2012-05-03 20:39:11
Also in: linux-pm, lkml

On Thursday, May 03, 2012, Colin Cross wrote:

On Thu, May 3, 2012 at 1:00 PM, Rafael J. Wysocki [off-list ref] wrote:
<snip>

quoted

There are two distinct cases to consider here, (1) when the last I/O
device in the domain becomes idle and the question is whether or not to
power off the entire domain and (2) when a CPU core in a power domain
becomes idle while all of the devices in the domain are idle already.

Case (2) is quite straightforward, the .enter() routine for the
"domain" C-state has to check whether the domain can be turned off and
do it eventually.

Case (1) is more difficult and (assuming that all CPU cores in the domain
are already idle at this point) i see two possible ways to handle it:
(a) Wake up all of the (idle) CPU cores in the domain and let the
 "domain" C-state's .enter() do the job (ie. turn it into case (2)),
 similarly to your patchset.
(b) If cpuidle has prepared the cores for going into deeper idle,
 turn the domain off directly without waking up the cores.

Multiple clusters is a design that has been considered in this
patchset (all the data structures are in the right place to support
it), and can be supported in the future, but does not exist in any
current systems that would be using this.  In all of today's SoCs,
there is a single cluster, so (1) can't happen - no code can be
executing while all cpus are idle.

OK, but I think it should be taken into consideration nonetheless.

(b) is an optimization that would not be possible on any future SoC
that is similar to the current SoCs, where "turn the domain off" is
very tightly integrated with TrustZone secure code running on the
primary cpu of the cluster.

I see.

<snip>

quoted

Having considered this for a while I think that it may be more straightforward
to avoid waking up the already idled cores.

For instance, say we have 4 CPU cores in a cluster (package) such that each
core has its own idle state (call it C1) and there is a multicore idle state
entered by turning off the entire cluster (call this state C-multi).  One of
the possible ways to handle this seems to be to use an identical table of
C-states for each core containing the C1 entry and a kind of fake entry called
(for example) C4 with the time characteristics of C-multi and a special
.enter() callback.  That callback will prepare the core it is called for to
enter C-multi, but instead of simply turning off the whole package it will
decrement a counter.  If the counte happens to be 0 at this point, the
package will be turned off.  Otherwise, the core will be put into the idle
state corresponding to C1, but it will be ready for entering C-multi at
any time. The counter will be incremented on exiting the C4 "state".

I implemented something very similar to this on Tegra2 (having each
cpu go to C1, but with enough state saved for C-multi), but it turns
out not to work in hardware.  On every existing ARM SMP system where I
have worked with cpuidle (Tegra2, OMAP4, Exynos5, and some Tegra3),
only cpu 0 can trigger the transition to C-multi.  The cause of this
restriction is different on every platform - sometimes it's by design,
sometimes it's a bug in the SoC ROM code, but the restriction exists.
The primary cpu of the cluster always needs to be awake.

OK, so that means we need to do the wakeup for technical reasons.

In addition, it may not be possible to transition secondary cpus from
C1 to C-multi without waking them.  That would generally involve
cutting power to a CPU that is in clock gating, which is not a
supported power transition in any SoC that I have a datasheet for.  I
made it work for cpu1 on Tegra2, but I can't guarantee that there are
not unsolvable HW race conditions.

The only generic way to make this work is to wake up all cpus.  Waking
up a subset of cpus is certainly worth investigating as an
optimization, but it would not be used on Tegra2, OMAP4, or Exynos5.
Tegra3 may benefit from it.

OK

quoted

It looks like this should work without modifying the cpuidle core, but
the drawback here is that the cpuidle core doesn't know how much time
spend in C4 is really in C1 and how much of it is in C-multi, so the
statistics reported by it won't reflect the real energy usage.

Idle statistics are extremely important when determining why a
particular use case is drawing too much power, and it is worth
modifying the cpuidle core if only to keep them accurate.  Especially
when justifying the move from the cpufreq hotplug governor based code
that every SoC vendor uses in their BSP to a proper multi-CPU cpuidle
implementation.

I see.

Thanks for the explanation,
Rafael

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help