[linux-pm] [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus

From: Rafael J. Wysocki <hidden>
Date: 2012-05-04 22:23:09
Also in: linux-pm, lkml

On Friday, May 04, 2012, Colin Cross wrote:

On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki [off-list ref] wrote:

quoted

On Friday, May 04, 2012, Colin Cross wrote:

quoted

On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki [off-list ref] wrote:

[...]

quoted

+/**
+ * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting
+ * @coupled: the struct coupled that contains the current cpu
+ *
+ * Returns true if all cpus coupled to this target state are in the wait loop
+ */
+static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled)
+{
+     int alive;
+     int waiting;
+
+     /*
+      * Read alive before reading waiting so a booting cpu is not treated as
+      * idle
+      */

Well, the comment doesn't really explain much.  In particular, why the boot CPU
could be treated as idle if the reads were in a different order.

Hm, I think the race condition is on a cpu going down.  What about:
Read alive before reading waiting.  If waiting is read before alive,
this cpu could see another cpu as waiting just before it goes offline,
between when it the other cpu decrements waiting and when it
decrements alive, which could cause alive == waiting when one cpu is
not waiting.

Reading them in this particular order doesn't stop the race, though.  I mean,
if the hotplug happens just right after you've read alive_count, you still have
a wrong value.  waiting_count is set independently, it seems, so there's no
ordering between the two on the "store" side and the "load" side ordering
doesn't matter.

As commented in the hotplug path, hotplug relies on the fact that one
of the cpus in the cluster is involved in the hotplug of the cpu that
is changing (this may not be true for multiple clusters, but it is
easy to fix by IPI-ing to a cpu that is in the same cluster when that
happens).

That's very fragile and potentially sets a trap for people trying to make
the kernel work on systems with multiple clusters.

That means that waiting count is always guaranteed to be at
least 1 less than alive count when alive count changes.  All this read
ordering needs to do is make sure that this cpu doesn't see
waiting_count == alive_count by reading them in the wrong order.

So, the concern seems to be that if the local CPU reorders the reads
from waiting_count and alive_count and enough time elapses between one
read and the other, the decrementation of waiting_count may happen
between them and then the CPU may use the outdated value for comparison,
right?

Still, though, even if the barrier is there, the modification of
alive_count in the hotplug notifier routine may not happen before
the read from alive_count in cpuidle_coupled_cpus_waiting() is completed.
Isn't that a problem?

quoted

I would just make the CPU hotplug notifier routine block until
cpuidle_enter_state_coupled() is done and the latter return immediately
if the CPU hotplug notifier routine is in progress, perhaps falling back
to the safe state.  Or I would make the CPU hotplug notifier routine
disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE
and only re-enable it after the hotplug has been completed.

I'll take a look at disabling coupled idle completely during hotplug.

Great, thanks!

quoted

+     alive = atomic_read(&coupled->alive_count);
+     smp_rmb();
+     waiting = atomic_read(&coupled->waiting_count);

Have you considered using one atomic variable to accommodate both counters
such that the upper half contains one counter and the lower half contains
the other?

There are 3 counters (alive, waiting, and ready).  Do you want me to
squish all of them into a single atomic_t, which would limit to 1023
cpus?

No.  I'd make sure that cpuidle_enter_state_coupled() did't race with CPU
hotplug, so as to make alive_count stable from its standpoint, and I'd
put the two remaining counters into one atomic_t variable.

I'll take a look at using a single atomic_t.  My initial worry was
that the increased contention on the shared variable would cause more
cmpxchg retries, but since waiting_count and ready_count are designed
to be modified in sequential phases that shouldn't be an issue.

[...]

quoted

+     while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) {
+             entered_state = cpuidle_enter_state(dev, drv,
+                     dev->safe_state_index);
+
+             local_irq_enable();
+             while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
+                     cpu_relax();

Hmm.  What exactly is this loop supposed to achieve?

This is to ensure that the outstanding wakeups have been processed so
we don't go to idle with an interrupt pending an immediately wake up.

I see.  Is it actually safe to reenable interrupts at this point, though?

I think so.  The normal idle loop will enable interrupts in a similar
fashion to what happens here.  There are two things to worry about: a
processed interrupt causing work to be scheduled that should bring
this cpu out of idle, or changing the next timer which would
invalidate the current requested state.  The first is handled by
checking need_resched() after interrupts are disabled again, the
second is currently unhandled but does not affect correct operation,
it just races into a less-than-optimal idle state.

I see.

quoted

+             local_irq_disable();

Anyway, you seem to be calling it twice along with this enabling/disabling of
interrupts.  I'd put that into a separate function and explain its role in a
kerneldoc comment.

I left it here to be obvious that I was enabling interrupts in the
idle path, but I can refactor it out if you prefer.

Well, you can call the function to make it obvious. :-)

Anyway, I think that code duplication is a worse thing than a reasonable
amount of non-obviousness, so to speak.

quoted

+     }
+
+     /* give a chance to process any remaining pokes */
+     local_irq_enable();
+     while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
+             cpu_relax();
+     local_irq_disable();
+
+     if (need_resched()) {
+             cpuidle_coupled_set_not_waiting(dev, coupled);
+             goto out;
+     }
+
+     /*
+      * All coupled cpus are probably idle.  There is a small chance that
+      * one of the other cpus just became active.  Increment a counter when
+      * ready, and spin until all coupled cpus have incremented the counter.
+      * Once a cpu has incremented the counter, it cannot abort idle and must
+      * spin until either the count has hit alive_count, or another cpu
+      * leaves idle.
+      */
+
+     smp_mb__before_atomic_inc();
+     atomic_inc(&coupled->ready_count);
+     smp_mb__after_atomic_inc();

It seems that at least one of these barriers is unnecessary ...

The first is to ensure ordering between ready_count and waiting count,

Are you afraid that the test against waiting_count from
cpuidle_coupled_cpus_waiting() may get reordered after the incrementation
of ready_count or is it something else?

Yes, ready_count must not be incremented before waiting_count == alive_count.

Well, control doesn't reach the atomic_inc() statement if this condition
is not satisfied, so I don't see how it can be possibly reordered before
the while () loop without breaking the control flow guarantees.

quoted

the second is for ready_count vs. alive_count and requested_state.

This one I can understand, but ...

quoted

+     /* alive_count can't change while ready_count > 0 */
+     alive = atomic_read(&coupled->alive_count);

What happens if CPU hotplug happens right here?

According to the comment above that line that can't happen -
alive_count can't change while ready_count > 0, because that implies
that all cpus are waiting and none can be in the hotplug path where
alive_count is changed.  Looking at it again that is not entirely
true, alive_count could change on systems with >2 cpus, but I think it
can't cause an issue because alive_count would be 2 greater than
waiting_count before alive_count was changed.  Either way, it will be
fixed by disabling coupled idle during hotplug.

Yup.

quoted

+     while (atomic_read(&coupled->ready_count) != alive) {
+             /* Check if any other cpus bailed out of idle. */
+             if (!cpuidle_coupled_cpus_waiting(coupled)) {
+                     atomic_dec(&coupled->ready_count);
+                     smp_mb__after_atomic_dec();

And the barrier here?  Even if the old value of ready_count leaks into
the while () loop after retry, that doesn't seem to matter.

All of these will be academic if ready_count and waiting_count share
an atomic_t.
waiting_count must not be decremented by exiting the while loop after
the retry label until ready_count is decremented here, but that is
also protected by the barrier in set_not_waiting.  One of them could
be dropped.

quoted

+                     goto retry;
+             }
+
+             cpu_relax();
+     }
+
+     /* all cpus have acked the coupled state */
+     smp_rmb();

What is the barrier here for?

This protects ready_count vs. requested_state.  It is already
implicitly protected by the atomic_inc_return in set_waiting, but I
thought it would be better to protect it explicitly here.  I think I
added the smp_mb__after_atomic_inc above later, which makes this one
superflous, so I'll drop it.

OK

quoted

+
+     next_state = cpuidle_coupled_get_state(dev, coupled);
+
+     entered_state = cpuidle_enter_state(dev, drv, next_state);
+
+     cpuidle_coupled_set_not_waiting(dev, coupled);
+     atomic_dec(&coupled->ready_count);
+     smp_mb__after_atomic_dec();
+
+out:
+     /*
+      * Normal cpuidle states are expected to return with irqs enabled.
+      * That leads to an inefficiency where a cpu receiving an interrupt
+      * that brings it out of idle will process that interrupt before
+      * exiting the idle enter function and decrementing ready_count.  All
+      * other cpus will need to spin waiting for the cpu that is processing
+      * the interrupt.  If the driver returns with interrupts disabled,
+      * all other cpus will loop back into the safe idle state instead of
+      * spinning, saving power.
+      *
+      * Calling local_irq_enable here allows coupled states to return with
+      * interrupts disabled, but won't cause problems for drivers that
+      * exit with interrupts enabled.
+      */
+     local_irq_enable();
+
+     /*
+      * Wait until all coupled cpus have exited idle.  There is no risk that
+      * a cpu exits and re-enters the ready state because this cpu has
+      * already decremented its waiting_count.
+      */
+     while (atomic_read(&coupled->ready_count) != 0)
+             cpu_relax();
+
+     smp_rmb();

And here?

This was to protect ready_count vs. looping back in and reading
alive_count.

Well, I'm lost. :-)

You've not modified anything after the previous smp_mb__after_atomic_dec(),
so what exactly is the reordering this is supposed to work against?

And while we're at it, I'm not quite sure what the things that the previous
smp_mb__after_atomic_dec() separates from each other are.

Instead of justifying all of these, let me try the combined atomic_t
trick and justify the (many fewer) remaining barriers.

OK, cool! :-)

Thanks,
Rafael

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help