Re: [BUG nohz]: wrong user and system time accounting

From: Luiz Capitulino <hidden>
Date: 2017-03-28 13:53:09
Also in: lkml

On Tue, 28 Mar 2017 13:28:13 +0800
Wanpeng Li [off-list ref] wrote:

2017-03-28 2:38 GMT+08:00 Luiz Capitulino [off-list ref]:

quoted

On Mon, 27 Mar 2017 09:56:47 +0800
Wanpeng Li [off-list ref] wrote:

quoted

Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched,
time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true,
then just stress all the online cpus or just one cpu and leave others
idle(so it stresses the global timekeeping one), top show 100%
sys-time. And another way to reproduce it is by nohz_full, and gives
the stress to the house keeping cpu, the top show 100% sys-time of the
house keeping cpu, and also the other cpus who have at least two tasks
running on and in full_nohz mode.

We're not short on reproducers, I have a new one too:

 http://people.redhat.com/~lcapitul/real-time/acct-bug.c

This is a single threaded task that reproduces the issue. If you
run it as instructed, you'll get:

 - nohz_full CPU: 95% system time 5% idle time
 - non-nohz_full CPU: 95% user time 5% idle time (expected behavior)

This reproduces the issue, but not for the reasons I expected. I was
trying to mimic what I was seeing on my trace when tracing the two
task problem. Which is: a task stays 995us in user-space and then
enters the kernel. Time won't be accounted for user-space because
we're not 1 jiffies yet, but if the task stays in the kernel for more
than 5us, then time will be accounted for system time when going
back to user-space.

However, what really seems to be happening is: acct-bug is causing
the tick to be re-activated (why? it shouldn't) and that causes the
issue to appear. This is consistent with my other observations: I
can only reproduce the issue if the nohz_full CPU re-activates the tick.

I see there are other kthreads like migration, kworker,
torture_shuffle etc on the isolated CPU.

Except for torture_shuffle (which is new to me, and I guess could
be disabled in .config) the other threads should not be runnable
for most of the time.

Regards,
Wanpeng Li

quoted

Let's consider the cpu which has responsibility for the global
timekeeping, as the tracing posted above, the vtime_account_user() is
called before tick_sched_timer() which will update jiffies,

But the vtime_account_user() call and the jiffies update happen
on different CPUs, no? So the ordering shouldn't matter.

quoted

so jiffies
is stale in vtime_account_user() and the run time in userspace is
skipped, the vtime_user_enter() is called after jiffies update, so
both the time in userspace and in  kernel are accumulated to sys time.

If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
fine. However, if you give stress to the housekeeping cpu, top will
show 100% sys-time of both the housekeeping cpu and the other cpus who
have at least two tasks running on and in full_nohz mode.

The housekeeping CPUs are idle with my reproducers.

quoted

I think it
is because the stress delays the timer interrupt handling in some
degree, then the jiffies is not updated timely before other cpus
access it in vtime_account_user().

I think we can keep syscalls/exceptions context tracking still in
jiffies based sampling and utilize local_clock() in vtime_delta()
again for irqs which avoids jiffies stale influence. I can make a
patch if the idea is acceptable or there is any better proposal. :)

Regards,
Wanpeng Li

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help