Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group... | linux-api

[PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 02/10] cgroup: un-inline cgroup_path() and friends · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 04/10] signal: make put_signal_struct() public · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 08/10] cgroup: implement rgroup control mask handling · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 07/10] cgroup: introduce resource group · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 09/10] cgroup: implement rgroup subtree migration · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks · Tejun Heo <tj@kernel.org> · 2016-03-11
[PATCH 01/10] cgroup: introduce cgroup_[un]lock() · Tejun Heo <tj@kernel.org> · 2016-03-11
Example program for PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-03-11
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-03-12
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-03-12
cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Ingo Molnar <mingo@kernel.org> · 2016-03-12
Re: cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-03-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-03-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-03-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-03-14
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-03-14
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-06
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Johannes Weiner <hannes@cmpxchg.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Johannes Weiner <hannes@cmpxchg.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-08
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-09
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-09
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-12
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-13
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-14
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-14
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-15
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-09
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Johannes Weiner <hannes@cmpxchg.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Johannes Weiner <hannes@cmpxchg.org> · 2016-04-07
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Mike Galbraith <hidden> · 2016-04-08
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Michal Hocko <mhocko@kernel.org> · 2016-03-15
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Tejun Heo <tj@kernel.org> · 2016-04-06
Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP · Peter Zijlstra <peterz@infradead.org> · 2016-04-07

Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP

From: Tejun Heo <hidden>
Date: 2016-04-08 20:11:41
Also in: cgroups, lkml

Hello, Peter.

On Thu, Apr 07, 2016 at 10:25:42PM +0200, Peter Zijlstra wrote:

quoted

The balkanization was no coincidence either.  Tasks and cgroups are
different types of entities and don't have the same control knobs or
follow the same lifetime rules.  For absolute limits, it isn't clear
how much of the parent's resources should be distributed to internal
children as opposed to child cgroups.  People end up depending on
specific implementation details and proposing one-off hacks and
interface additions.

Yes, I'm familiar with the problem; but simply mandating leaf only nodes
is not a solution, for the very simple fact that there are tasks in the
root cgroup that cannot ever be moved out, so we _must_ be able to deal
with !leaf nodes containing tasks.

As Johannes already pointed out, the root cgroup has always been
special.  While pure practicality, performance implications and
implementation convenience do play important roles in the special
treatment, another constributing aspect is avoiding exposing
statistics and control knobs which are duplicates of and/or
conflicting with what's already available at the system level.  It's
never fun to have multiple sources of truth.

A consistent interface for absolute controllers to divvy up the
resources between local tasks and child cgroups isn't _that_ hard.

I've spent months thinking about it and didn't get too far.  If you
have a good solution, I'd be happy to be enlightened.  Also, please
note that the current solution is based on restricting certain
configurations.  If we can find a better solution, we can relax the
relevant constraints and move onto it without breaking compatibility.

And this leaf only business totally screwed over anything proportional.

This simply cannot work.

Will get to this below.

quoted

Proportional weights aren't much better either.  CPU has internal
mapping between nice values and shares and treat them equally, which
can get confusing as the configured weights behave differently
depending on how many threads are in the parent cgroup which often is
opaque and can't be controlled from outside.

Huh what? There's nothing confusing there, the nice to weight mapping is
static and can easily be consulted. Alternatively we can make an
interface where you can set weight through nice values, for those people
that are afraid of numbers.

But the configured weights do _not_ behave differently depending on the
number of tasks, they behave exactly as specified in the proportional
weight based rate distribution. We've done the math..

Yes, once one understands what's going on, it isn't confusing.  It's
just not something users can intuitively understand from the presented
interface.  The confusion of course is worsened severely by different
controller behaviors.

quoted

Widely diverging from
CPU's behavior, IO grouped all internal tasks into an internal leaf
node and used to assign a fixed weight to it.

That's just plain broken... That is not how a proportional weight based
hierarchical controller works.

That's a strong statement.  When the hierarchy is composed of
equivalent objects as in CPU, not distinguishing internal and leaf
nodes would be a more natural way to organize; however, it isn't
necessarily true in all cases.  For example, while a writeback IO
would be issued by some task, the task itself might not have done
anything to cause that IO and the IO would essentially be anonymous in
the resource domain.  Also, different controllers use different units
of organization - CPU sees threads, IO sees IO contexts which are
usually shared in a process.  The difference would lead to differing
scaling behaviors in proportional distribution.

While the separate buckets and entities model may not be as elegant as
tree of uniform objects, it is far from uncommon and more robust when
dealing with different types of objects.

quoted

Now, you might think that none of it matters and each subsystem
treating cgroup hierarchy as arbitrary and orthogonal collections of
bean counters is fine; however, that makes it impossible to account
for and control operations which span different types of resources.
This prevented us from implementing resource control over frigging
buffered writes, making the whole IO control thing a joke.  While CPU
currently doesn't directly tie into it, that is only because CPU
cycles spent during writeback isn't yet properly accounted.

CPU cycles spend in waitqueues aren't properly accounted to whoever
queued the job either, and there's a metric ton of async stuff that's
not properly accounted, so what?

The ultimate goal of cgroup resource control is accounting and
controlling all significant resource consumptions as configured.  Some
system operations are inherently global and others are simply too
cheap to justify the overhead; however, there still are significant
aggregate operations which are being missed out including almost
everything taking place in the writeback path.  So, yes, we eventually
want to be able to account for them, of course in a way which doesn't
get in the way of actual operation.

quoted

However, please understand that there are a lot of use cases where
comprehensive and consistent resource accounting and control over all
major resources is useful and necessary.

Maybe, but so far I've only heard people complain this v2 thing didn't
work for them, and as far as I can see the whole v2 model is internally
inconsistent and impossible to implement.

I suppose we live in different bubbles.  Can you please elaborate
which parts of cgroup v2 model are internally inconsistent and
impossible to implement?  I'd be happy to rectify the situation.

The suggestion by Johannes to adjust the leaf node weight depending on
the number of tasks in is so ludicrous I don't even know where to start
enumerating the fail.

That sounds like a pretty uncharitable way to read his message.  I
think he was trying to find out the underlying requirements so that a
way forward can be discussed.  I do have the same question.  It's
difficult to have discussions about trade-offs without knowing where
the requirements are coming from.  Do you have something on mind for
cases where internal tasks have to compete with sibling cgroups?

Thanks.

-- 
tejun

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help