Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
From: Peter Zijlstra <peterz@infradead.org>
Date: 2017-08-02 16:05:32
Also in:
lkml
On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:
quoted
Not entirely sure I follow, we currently only update the current cgroup and its immediate parents, no? Or are you looking to only account into the current cgroup and propagate into the parents on reading?Yeah, shifting the cost to the readers and being smart with propagation so that reading isn't O(nr_descendants) but O(nr_descendants_which_have_run_since_last_read). That way, we can show the basic stats without taxing the hot paths with reasonable scalability.
Right, that would be good.
I have a couple questions about cpuacct tho. * The stat file is sampling based and the usage files are calculated from actual scheduling events. Is this because the latter is more accurate?
So I actually don't know the history of this stuff too well. But I would think so. This all looks rather dodgy.
* Why do we have user/sys breakdown in usage numbers? It tries to distinguish user or sys by looking at task_pt_regs(). I can't see how this would work (e.g. interrupt handlers never schedule) and w/o kernel preemption, the sys part is always zero. What is this number supposed to mean?
For normal scheduler stuff we account the total runtime in ns and use the user/kernel tick samples to divide it into user/kernel time parts. See cputime_adjust(). But looking at the cpuacct I have no clue, that looks wonky at best. Ideally we'd reuse the normal cputime code and do the same thing per-cgroup, but clearly that isn't happening now. I never really looked further than that cpuacct_charge() doing _another_ cgroup iteration, even though we already account that delta to each cgroup (modulo scheduling class crud).