Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy

From: Peter Zijlstra <peterz@infradead.org>
Date: 2017-08-02 16:05:32
Also in: lkml

On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:

quoted

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

Right, that would be good.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

So I actually don't know the history of this stuff too well. But I would
think so. This all looks rather dodgy.

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

For normal scheduler stuff we account the total runtime in ns and use
the user/kernel tick samples to divide it into user/kernel time parts.
See cputime_adjust().

But looking at the cpuacct I have no clue, that looks wonky at best.

Ideally we'd reuse the normal cputime code and do the same thing
per-cgroup, but clearly that isn't happening now.

I never really looked further than that cpuacct_charge() doing _another_
cgroup iteration, even though we already account that delta to each
cgroup (modulo scheduling class crud).

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help