Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy

From: Tejun Heo <hidden>
Date: 2017-08-02 15:41:41
Also in: lkml

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:

quoted

* On cgroup2, there is only one hierarchy.  It'd be great to have
  basic resource accounting enabled by default on all cgroups.  Note
  that we couldn't do that on v1 because there could be any number of
  hierarchies and the cost would increase with the number of
  hierarchies.

Yes, the whole single hierarchy thing makes doing away with the double
accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

quoted

* It is bothersome that we're walking up the tree each time for
  cpuacct although being percpu && just walking up the tree makes it
  relatively cheap.

So even if its only CPU local accounting, you still have all the pointer
chasing and misses, not to mention that a faster O(depth) is still
O(depth).

quoted

	Anyways, I'm thinking about shifting the
  aggregation to the reader side so that the hot path always only
  updates local counters in a way which can scale even when there are
  a lot of (idle) cgroups.  Will follow up on this later.

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help