Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
From: Paul E. McKenney <hidden>
Date: 2016-08-11 22:29:17
Also in:
lkml
On Thu, Aug 11, 2016 at 10:40:02AM +0200, Peter Zijlstra wrote:
On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:quoted
I had similar issues, this seems to happen when the tsc is considered not reliable (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature flag).Right, as per the other email, in general we cannot know/assume the TSC to be working as intended :/quoted
IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs are concerned.With modern Intel we could run it on one CPU per package I think, but at the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so it doesn't make sense to me to keep the watchdog running, when it triggers it would also have to kill all NOHZ_FULL stuff, which would probably bring the entire machine down..
Well, you -could- force a very low priority CPU-bound task to run on all nohz_full CPUs. Not necessarily a good idea, but a relatively non-intrusive response to that particular error condition. Thanx, Paul
Arguably we should issue a boot time warning if NOHZ_FULL is configured and the TSC watchdog is running.quoted
I personally override that with passing the tsc=reliable kernel parameter. Of course use it at your own risk.Yes, that is (sadly) our only option. Manually assert our hardware is solid under the intended workload and then manually disabling the watchdog.