Thread (50 messages) 50 messages, 6 authors, 2016-04-15

Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP

From: Peter Zijlstra <hidden>
Date: 2016-04-07 20:25:55
Also in: cgroups, lkml

On Thu, Apr 07, 2016 at 03:45:55PM -0400, Tejun Heo wrote:
Hello, Peter.

On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote:
quoted
On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
quoted
So it was a nice cleanup for the memory controller and I believe the
IO controller as well. I'd be curious how it'd be a problem for CPU?
The full hierarchy took years to make work and is fully ingrained with
how the thing words, changing it isn't going to be nice or easy.

So sure, go with a lowest common denominator, instead of fixing shit,
yay for progress :/
It's easy to get fixated on what each subsystem can do and develop
towards different directions siloed in each subsystem.  That's what
we've had for quite a while in cgroup.  Expectedly, this sends off
controllers towards different directions.  Direct competion between
tasks and child cgroups was one of the main sources of balkanization.

The balkanization was no coincidence either.  Tasks and cgroups are
different types of entities and don't have the same control knobs or
follow the same lifetime rules.  For absolute limits, it isn't clear
how much of the parent's resources should be distributed to internal
children as opposed to child cgroups.  People end up depending on
specific implementation details and proposing one-off hacks and
interface additions.
Yes, I'm familiar with the problem; but simply mandating leaf only nodes
is not a solution, for the very simple fact that there are tasks in the
root cgroup that cannot ever be moved out, so we _must_ be able to deal
with !leaf nodes containing tasks.

A consistent interface for absolute controllers to divvy up the
resources between local tasks and child cgroups isn't _that_ hard.

And this leaf only business totally screwed over anything proportional.

This simply cannot work.
Proportional weights aren't much better either.  CPU has internal
mapping between nice values and shares and treat them equally, which
can get confusing as the configured weights behave differently
depending on how many threads are in the parent cgroup which often is
opaque and can't be controlled from outside.
Huh what? There's nothing confusing there, the nice to weight mapping is
static and can easily be consulted. Alternatively we can make an
interface where you can set weight through nice values, for those people
that are afraid of numbers.

But the configured weights do _not_ behave differently depending on the
number of tasks, they behave exactly as specified in the proportional
weight based rate distribution. We've done the math..
Widely diverging from
CPU's behavior, IO grouped all internal tasks into an internal leaf
node and used to assign a fixed weight to it.
That's just plain broken... That is not how a proportional weight based
hierarchical controller works.
Now, you might think that none of it matters and each subsystem
treating cgroup hierarchy as arbitrary and orthogonal collections of
bean counters is fine; however, that makes it impossible to account
for and control operations which span different types of resources.
This prevented us from implementing resource control over frigging
buffered writes, making the whole IO control thing a joke.  While CPU
currently doesn't directly tie into it, that is only because CPU
cycles spent during writeback isn't yet properly accounted.
CPU cycles spend in waitqueues aren't properly accounted to whoever
queued the job either, and there's a metric ton of async stuff that's
not properly accounted, so what?
However, please understand that there are a lot of use cases where
comprehensive and consistent resource accounting and control over all
major resources is useful and necessary.
Maybe, but so far I've only heard people complain this v2 thing didn't
work for them, and as far as I can see the whole v2 model is internally
inconsistent and impossible to implement.

The suggestion by Johannes to adjust the leaf node weight depending on
the number of tasks in is so ludicrous I don't even know where to start
enumerating the fail.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help