TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split... | linux-arm-kernel

TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)

From: Frederic Weisbecker <hidden>
Date: 2014-07-31 18:47:39
Also in: linux-arch, linux-mips, lkml

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:

On 07/31, Frederic Weisbecker wrote:

quoted

On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:

quoted

On 07/31, Frederic Weisbecker wrote:

quoted

I was convinced that the very first kernel init task is PID 0 then
it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
idle task of the boot CPU.

Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
by "swapper".

Are you sure? It's called from start_kernel() which is init/0.

But do_initcalls() is called by kernel_init(), this is the init process which is
going to exec /sbin/init later.

But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

quoted

Maybe we should just enable it everywhere.

Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
of start_kernel(). The question is, I still can't understand why do we want to
have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

quoted

OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
CPU_0, and CPU_1 can miss it anyway.

quoted

But global TIF_NOHZ should enforce context tracking everywhere.

And this is what I can't understand. Lets return to my initial question, why
we can't change __context_tracking_task_switch()

	void __context_tracking_task_switch(struct task_struct *prev,
					    struct task_struct *next)
	{
		if (context_tracking_cpu_is_enabled())
			set_tsk_thread_flag(next, TIF_NOHZ);
		else
			clear_tsk_thread_flag(next, TIF_NOHZ);
	}

?

Well we can change it to global TIF_NOHZ

quoted

How can the global TIF_NOHZ help?

It avoids that flag swap on task_switch.

Ah, you probably meant that we can kill context_tracking_task_switch() as
we discussed.

Yeah.

But I meant another thing, TIF_NOHZ is already global because it is always
set after context_tracking_cpu_set().

Performance-wise, this set/clear code above can be better because it avoids
the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

But let's ignore this, the question is
why the change above is not correct? Or why it can make the things worse?

Which change above?

quoted

OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
comes to user_enter(). But this case is very unlikely, certainly this can't
explain why do we penalize the untracked CPU's ?

It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
to userspace.

And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

quoted

It's unlikely but possible. I actually observed that very easily on early testing.

Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
is wrong?

Which change?

quoted

And it's a big problem because then the CPU runs in userspace, possibly for a long while
in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
userspace.

It looks like a harmless race but it can have big consequences.

I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

quoted

Because calling context_switch_task_switch() on every context switch is costly.

See above. This depends, but forcing the slow paths on all CPU's can be more
costly.

Yeah but I don't have a safe solution that avoids that.

quoted

And of course I can't understand exception_enter/exit(). Not to mention that
(afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
if we forget that the caller can migrate in between. Just because, once again,
a tracked CPU can miss user_exit().

You lost me on this. How can a tracked CPU miss user_exit()?

I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

quoted

That's how I implemented it first. But then I changed it the way it is now:
6c1e0256fad84a843d915414e4b5973b7443d48d
("context_tracking: Restore correct previous context state on exception exit")

This is again due to the shift between hard and soft userspace boundaries.
user_mode(regs) checks hard boundaries only.

Lets get back to our beloved example:

          CPU 0                                  CPU 1
          ---------------------------------------------

          returning from syscall {
               user_enter();
               exception {
                    exception_enter()
                    PREEMPT!
                    ----------------------->
                                                 //resume exception
                                                   exception_exit();
                                                   return to userspace

OK, thanks, so in this case we miss user_enter().

But again, we can miss it anyway if preemptions comes before "exception" ?
if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help