Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
From: Tejun Heo <hidden>
Date: 2017-08-02 15:41:41
Also in:
lkml
Hello, Peter. On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
quoted
* On cgroup2, there is only one hierarchy. It'd be great to have basic resource accounting enabled by default on all cgroups. Note that we couldn't do that on v1 because there could be any number of hierarchies and the cost would increase with the number of hierarchies.Yes, the whole single hierarchy thing makes doing away with the double accounting possible.
Yeah, we can either do that or make it cheaper so that we can have basic stats by default.
quoted
* It is bothersome that we're walking up the tree each time for cpuacct although being percpu && just walking up the tree makes it relatively cheap.So even if its only CPU local accounting, you still have all the pointer chasing and misses, not to mention that a faster O(depth) is still O(depth).quoted
Anyways, I'm thinking about shifting the aggregation to the reader side so that the hot path always only updates local counters in a way which can scale even when there are a lot of (idle) cgroups. Will follow up on this later.Not entirely sure I follow, we currently only update the current cgroup and its immediate parents, no? Or are you looking to only account into the current cgroup and propagate into the parents on reading?
Yeah, shifting the cost to the readers and being smart with propagation so that reading isn't O(nr_descendants) but O(nr_descendants_which_have_run_since_last_read). That way, we can show the basic stats without taxing the hot paths with reasonable scalability. I have a couple questions about cpuacct tho. * The stat file is sampling based and the usage files are calculated from actual scheduling events. Is this because the latter is more accurate? * Why do we have user/sys breakdown in usage numbers? It tries to distinguish user or sys by looking at task_pt_regs(). I can't see how this would work (e.g. interrupt handlers never schedule) and w/o kernel preemption, the sys part is always zero. What is this number supposed to mean? Thanks. -- tejun