[linux-pm] [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus
From: Rafael J. Wysocki <hidden>
Date: 2012-05-04 22:23:09
Also in:
linux-pm, lkml
On Friday, May 04, 2012, Colin Cross wrote:
On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki [off-list ref] wrote:quoted
On Friday, May 04, 2012, Colin Cross wrote:quoted
On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki [off-list ref] wrote:[...]quoted
quoted
quoted
+/** + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting + * @coupled: the struct coupled that contains the current cpu + * + * Returns true if all cpus coupled to this target state are in the wait loop + */ +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled) +{ + int alive; + int waiting; + + /* + * Read alive before reading waiting so a booting cpu is not treated as + * idle + */Well, the comment doesn't really explain much. In particular, why the boot CPU could be treated as idle if the reads were in a different order.Hm, I think the race condition is on a cpu going down. What about: Read alive before reading waiting. If waiting is read before alive, this cpu could see another cpu as waiting just before it goes offline, between when it the other cpu decrements waiting and when it decrements alive, which could cause alive == waiting when one cpu is not waiting.Reading them in this particular order doesn't stop the race, though. I mean, if the hotplug happens just right after you've read alive_count, you still have a wrong value. waiting_count is set independently, it seems, so there's no ordering between the two on the "store" side and the "load" side ordering doesn't matter.As commented in the hotplug path, hotplug relies on the fact that one of the cpus in the cluster is involved in the hotplug of the cpu that is changing (this may not be true for multiple clusters, but it is easy to fix by IPI-ing to a cpu that is in the same cluster when that happens).
That's very fragile and potentially sets a trap for people trying to make the kernel work on systems with multiple clusters.
That means that waiting count is always guaranteed to be at least 1 less than alive count when alive count changes. All this read ordering needs to do is make sure that this cpu doesn't see waiting_count == alive_count by reading them in the wrong order.
So, the concern seems to be that if the local CPU reorders the reads from waiting_count and alive_count and enough time elapses between one read and the other, the decrementation of waiting_count may happen between them and then the CPU may use the outdated value for comparison, right? Still, though, even if the barrier is there, the modification of alive_count in the hotplug notifier routine may not happen before the read from alive_count in cpuidle_coupled_cpus_waiting() is completed. Isn't that a problem?
quoted
I would just make the CPU hotplug notifier routine block until cpuidle_enter_state_coupled() is done and the latter return immediately if the CPU hotplug notifier routine is in progress, perhaps falling back to the safe state. Or I would make the CPU hotplug notifier routine disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE and only re-enable it after the hotplug has been completed.I'll take a look at disabling coupled idle completely during hotplug.
Great, thanks!
quoted
quoted
quoted
quoted
+ alive = atomic_read(&coupled->alive_count); + smp_rmb(); + waiting = atomic_read(&coupled->waiting_count);Have you considered using one atomic variable to accommodate both counters such that the upper half contains one counter and the lower half contains the other?There are 3 counters (alive, waiting, and ready). Do you want me to squish all of them into a single atomic_t, which would limit to 1023 cpus?No. I'd make sure that cpuidle_enter_state_coupled() did't race with CPU hotplug, so as to make alive_count stable from its standpoint, and I'd put the two remaining counters into one atomic_t variable.I'll take a look at using a single atomic_t. My initial worry was that the increased contention on the shared variable would cause more cmpxchg retries, but since waiting_count and ready_count are designed to be modified in sequential phases that shouldn't be an issue.
[...]
quoted
quoted
quoted
quoted
+ while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) { + entered_state = cpuidle_enter_state(dev, drv, + dev->safe_state_index); + + local_irq_enable(); + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask)) + cpu_relax();Hmm. What exactly is this loop supposed to achieve?This is to ensure that the outstanding wakeups have been processed so we don't go to idle with an interrupt pending an immediately wake up.I see. Is it actually safe to reenable interrupts at this point, though?I think so. The normal idle loop will enable interrupts in a similar fashion to what happens here. There are two things to worry about: a processed interrupt causing work to be scheduled that should bring this cpu out of idle, or changing the next timer which would invalidate the current requested state. The first is handled by checking need_resched() after interrupts are disabled again, the second is currently unhandled but does not affect correct operation, it just races into a less-than-optimal idle state.
I see.
quoted
quoted
quoted
quoted
+ local_irq_disable();Anyway, you seem to be calling it twice along with this enabling/disabling of interrupts. I'd put that into a separate function and explain its role in a kerneldoc comment.I left it here to be obvious that I was enabling interrupts in the idle path, but I can refactor it out if you prefer.Well, you can call the function to make it obvious. :-) Anyway, I think that code duplication is a worse thing than a reasonable amount of non-obviousness, so to speak.quoted
quoted
quoted
+ } + + /* give a chance to process any remaining pokes */ + local_irq_enable(); + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask)) + cpu_relax(); + local_irq_disable(); + + if (need_resched()) { + cpuidle_coupled_set_not_waiting(dev, coupled); + goto out; + } + + /* + * All coupled cpus are probably idle. There is a small chance that + * one of the other cpus just became active. Increment a counter when + * ready, and spin until all coupled cpus have incremented the counter. + * Once a cpu has incremented the counter, it cannot abort idle and must + * spin until either the count has hit alive_count, or another cpu + * leaves idle. + */ + + smp_mb__before_atomic_inc(); + atomic_inc(&coupled->ready_count); + smp_mb__after_atomic_inc();It seems that at least one of these barriers is unnecessary ...The first is to ensure ordering between ready_count and waiting count,Are you afraid that the test against waiting_count from cpuidle_coupled_cpus_waiting() may get reordered after the incrementation of ready_count or is it something else?Yes, ready_count must not be incremented before waiting_count == alive_count.
Well, control doesn't reach the atomic_inc() statement if this condition is not satisfied, so I don't see how it can be possibly reordered before the while () loop without breaking the control flow guarantees.
quoted
quoted
the second is for ready_count vs. alive_count and requested_state.This one I can understand, but ...quoted
quoted
quoted
+ /* alive_count can't change while ready_count > 0 */ + alive = atomic_read(&coupled->alive_count);What happens if CPU hotplug happens right here?According to the comment above that line that can't happen - alive_count can't change while ready_count > 0, because that implies that all cpus are waiting and none can be in the hotplug path where alive_count is changed. Looking at it again that is not entirely true, alive_count could change on systems with >2 cpus, but I think it can't cause an issue because alive_count would be 2 greater than waiting_count before alive_count was changed. Either way, it will be fixed by disabling coupled idle during hotplug.
Yup.
quoted
quoted
quoted
quoted
+ while (atomic_read(&coupled->ready_count) != alive) { + /* Check if any other cpus bailed out of idle. */ + if (!cpuidle_coupled_cpus_waiting(coupled)) { + atomic_dec(&coupled->ready_count); + smp_mb__after_atomic_dec();And the barrier here? Even if the old value of ready_count leaks into the while () loop after retry, that doesn't seem to matter.All of these will be academic if ready_count and waiting_count share an atomic_t. waiting_count must not be decremented by exiting the while loop after the retry label until ready_count is decremented here, but that is also protected by the barrier in set_not_waiting. One of them could be dropped.quoted
quoted
quoted
quoted
+ goto retry; + } + + cpu_relax(); + } + + /* all cpus have acked the coupled state */ + smp_rmb();What is the barrier here for?This protects ready_count vs. requested_state. It is already implicitly protected by the atomic_inc_return in set_waiting, but I thought it would be better to protect it explicitly here. I think I added the smp_mb__after_atomic_inc above later, which makes this one superflous, so I'll drop it.OKquoted
quoted
quoted
+ + next_state = cpuidle_coupled_get_state(dev, coupled); + + entered_state = cpuidle_enter_state(dev, drv, next_state); + + cpuidle_coupled_set_not_waiting(dev, coupled); + atomic_dec(&coupled->ready_count); + smp_mb__after_atomic_dec(); + +out: + /* + * Normal cpuidle states are expected to return with irqs enabled. + * That leads to an inefficiency where a cpu receiving an interrupt + * that brings it out of idle will process that interrupt before + * exiting the idle enter function and decrementing ready_count. All + * other cpus will need to spin waiting for the cpu that is processing + * the interrupt. If the driver returns with interrupts disabled, + * all other cpus will loop back into the safe idle state instead of + * spinning, saving power. + * + * Calling local_irq_enable here allows coupled states to return with + * interrupts disabled, but won't cause problems for drivers that + * exit with interrupts enabled. + */ + local_irq_enable(); + + /* + * Wait until all coupled cpus have exited idle. There is no risk that + * a cpu exits and re-enters the ready state because this cpu has + * already decremented its waiting_count. + */ + while (atomic_read(&coupled->ready_count) != 0) + cpu_relax(); + + smp_rmb();And here?This was to protect ready_count vs. looping back in and reading alive_count.Well, I'm lost. :-) You've not modified anything after the previous smp_mb__after_atomic_dec(), so what exactly is the reordering this is supposed to work against? And while we're at it, I'm not quite sure what the things that the previous smp_mb__after_atomic_dec() separates from each other are.Instead of justifying all of these, let me try the combined atomic_t trick and justify the (many fewer) remaining barriers.
OK, cool! :-) Thanks, Rafael