Re: [PATCH] powerpc/pseries: Fix cpu hotplug
From: Sebastien Dugue <hidden>
Date: 2008-11-28 10:04:50
Hi Nathan, On Thu, 27 Nov 2008 18:14:33 -0600 Nathan Lynch [off-list ref] wrote:
Hi, I have some questions about this patch. Sebastien Dugue wrote:quoted
Currently, pseries_cpu_die() calls msleep() while polling RTAS for the status of the dying cpu. However if the cpu that is going down also happens to be the one doing the tick then we're hosed as the tick_do_timer_cpu 'baton' is only passed later on in tick_shutdown() when _cpu_down() does the CPU_DEAD notification. Therefore jiffies won't be updated anymore.I confess unfamiliarity with the tick/timer code, but this sounds like something that should be addressed earlier in the process of taking down a CPU.
Maybe you're right, at least the tick_do_timer_cpu should be changed earlier in the down process, but I'm not sure where we can do that.
quoted
This patch replaces that msleep() with a cpu_relax() to make sure we're not going to schedule at that point.This is a significant change in behavior. With the msleep(), we poll for at least five seconds before giving up; with the cpu_relax(), the period will almost certainly be much shorter and we're likely to give up too soon in some circumstances.
Right, I realized a bit late that that would indeed change the behaviour. On my test box (2 Power6) the msleep call is hit in ~10 % of the cases and only loop once, but that may not be the case for all the pSeries out there where a longer delay might be needed.
Could be addressed by using mdelay(), but...
Yep, would be better. I'm still wondering why the hang is not systematic when offlining the tick_do_timer_cpu, I must have missed something in my analysis and there might be a race somewhere I failed to identify.
It's just not clear to me how busy-waiting in the __cpu_die() path is a legitimate fix. Is sleeping in this path forbidden now?
In that case, I think it is if the cpu going down is also doing the tick.
(I notice at least native_cpu_die() in x86 does msleep(), btw.)
Right, as most other arches do, but the thing is that on those, you cannot offline CPU0, whereas on power (and maybe otheres too) you can.
As it can take several milliseconds for RTAS to report a CPU offline, and the maximum latency of the operation is unspecified, it seems inappropriate to tie up the waiting CPU this way.
Agreed.
quoted
With this patch my test box survives a 100k iterations hotplug stress test on _all_ cpus, whereas without it, it quickly dies after ~50 iterations.What is the failure (e.g. stack trace, kernel messages)?
No stack trace, no kernel messages, nothing :( When that happens, the cpus get stuck in idle and won't reschedule at all. I verified that the decrementer is still ticking, hrtimers are still running but regular timers are stuck. Changing the tick_do_timer_cpu manually under xmon resurects the box. Will have to look at this a bit more. Thanks, Sebastien.