Re: [Documentation] State of CPU controller in cgroup v2

From: Andy Lutomirski <luto@amacapital.net>
Date: 2016-08-20 18:46:19
Also in: linux-api, lkml

On Sat, Aug 20, 2016 at 8:56 AM, Tejun Heo [off-list ref] wrote:

Hello, Andy.

On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:

quoted

  2-1-1. Process Granularity

  For memory, because an address space is shared between all threads
  of a process, the terminal consumer is a process, not a thread.
  Separating the threads of a single process into different memory
  control domains doesn't make semantical sense.  cgroup v2 ensures
  that all controller can agree on the same organization by requiring
  that threads of the same process belong to the same cgroup.

I haven't followed all of the history here, but it seems to me that
this argument is less accurate than it appears.  Linux, for better or
for worse, has somewhat orthogonal concepts of thread groups
(processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
reference things (files, etc) that hold resources.  (Two mms can share
resources by mapping the same thing or using fork().)  File tables
hold files, and files can use resources.  Both of these are, at best,
moderately good approximations of what actually holds resources.
Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
resources, etc.

So I think it's not really true to say that the "terminal consumer" of
anything is a process, not a thread.

The terminal consumer is actually the mm context.  A task may be the
allocating entity but not always for itself.

This becomes clear whenever an entity is allocating memory on behalf
of someone else - get_user_pages(), khugepaged, swapoff and so on (and
likely userfaultfd too).  When a task is trying to add a page to a
VMA, the task might not have any relationship with the VMA other than
that it's operating on it for someone else.  The page has to be
charged to whoever is responsible for the VMA and the only ownership
which can be established is the containing mm_struct.

This surprises me a bit.  If I do access_process_vm(), then I would
have expected the charge to go the caller, not the mm being accessed.

What happens if a program calls read(2), though?  A page may be
inserted into page cache on behalf of an address_space without any
particular mm being involved.  There will usually be a calling task,
though.

But this is all very memcg-specific.  What about other cgroups?  I/O
is per-task, right?  Scheduling is definitely per-task.

While a mm_struct technically may not map to a process, it is a very
close approxmiation which is hardly ever broken in practice.

quoted

While it's certainly easier to think about assigning processes to
cgroups, and I certainly agree that, in the common case, it's the
right thing to do, I don't see why requiring it is a good idea.  Can
we turn this around: what actually goes wrong if cgroup v2 were to
allow assigning individual threads if a user specifically requests it?

Consider the scenario where you have somebody faulting on behalf of a
foreign VMA, but the thread who created and is actively using that VMA
is in a different cgroup than the process leader.  Who are we going to
charge?  All possible answers seem erratic.

Indeed, and this problem is probably not solvable in practice unless
you charge all involved cgroups.  But the caller's *mm* is entirely
irrelevant here, so I don't see how this implies that cgroups need to
keep tasks in the same process together.  The relevant entities are
the calling *task* and the target mm, and you're going to be
hard-pressed to ensure that they belong to the same cgroup, so I think
you need to be able handle weird cases in which there isn't an
obviously correct cgroup to charge.

quoted

  there are other reasons to enforce process granularity.  One
  important one is isolating system-level management operations from
  in-process application operations.  The cgroup interface, being a
  virtual filesystem, is very unfit for multiple independent
  operations taking place at the same time as most operations have to
  be multi-step and there is no way to synchronize multiple accessors.
  See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"

I don't buy this argument at all.  System-level code is likely to
assign single process *trees*, which are a different beast entirely.
I.e. you fork, move the child into a cgroup, and that child and its
children stay in that cgroup.  I don't see how the thread/process
distinction matters.

Good point on the multi-process issue, this is something which nagged
me a bit while working on rgroup, although I have to point out that
the issue here is one of not going far enough rather than the approach
being wrong.  There are limitations to scoping it to individual
processes but that doesn't negate the underlying problem or the
usefulness of in-process control.

For system-level and process-level operations to not step on each
other's toes, they need to agree on the granularity boundary -
system-level should be able to treat an application hierarchy as a
single unit.  A possible solution is allowing rgroup hirearchies to
span across process boundaries and implementing cgroup migration
operations which treat such hierarchies as a single unit.  I'm not yet
sure whether the boundary should be at program groups or rgroups.

I think that, if the system cgroup manager is moving processes around
after starting them and execing the final binary, there will be races
and confusion, and no about of granularity fiddling will fix that.

I know nothing about rgroups.  Are they upstream?

quoted

  2-1-2. No Internal Process Constraint

  cgroup v2 does not allow processes to belong to any cgroup which has
  child cgroups when resource controllers are enabled on it (the
  notable exception being the root cgroup itself).

Can you elaborate on this exception?  How do you get any of the
supposed benefits of not having processes and cgroups exist as
siblings when you make an exception for the root?  Similarly, if you
make an exception for the root, what do you do about cgroup namespaces
where the apparent root isn't the global root?

Having a special case doesn't necessarily get in the way of benefiting
from a set of general rules.  The root cgroup is inherently special as
it has to be the catch-all scope for entities and resource
consumptions which can't be tied to any specific consumer - irq
handling, packet rx, journal writes, memory reclaim from global memory
pressure and so on.  None of sub-cgroups have to worry about them.

These base-system operations are special regardless of cgroup and we
already have sometimes crude ways to affect their behaviors where
necessary through sysctl knobs, priorities on specific kernel threads
and so on.  cgroup doesn't change the situation all that much.  What
gets left in the root cgroup usually are the base-system operations
which are outside the scope of cgroup resource control in the first
place and cgroup resource graph can treat the root as an opaque anchor
point.

This seems to explain why the controllers need to be able to handle
things being charged to the root cgroup (or to an unidentifiable
cgroup, anyway).  That isn't quite the same thing as allowing, from an
ABI point of view, the root cgroup to contain processes and cgroups
but not allowing other cgroups to do the same thing.  Consider:
suppose that systemd (or some competing cgroup manager) is designed to
run in the root cgroup namespace.  It presumably expects *itself* to
be in the root cgroup.  Now try to run it using cgroups v2 in a
non-root namespace.  I don't see how it can possibly work if it the
hierarchy constraints don't permit it to create sub-cgroups while it's
still in the root.  In fact, this seems impossible to fix even with
user code changes.  The manager would need to simultaneously create a
new child cgroup to contain itself and assign itself to that child
cgroup, because the intermediate state is illegal.

I really, really think that cgroup v2 should supply the same
*interface* inside and outside of a non-root namespace.  If this is
impossible due to ABI compatibility, then you could, in the worst
case, introduce cgroup v3, fix it there, and remove cgroup v2, since
apparently cgroup v2 isn't in use right now in mainline kernels.  (To
be clear, I think either decision -- allowing tasks and cgroups to be
siblings or disallowing it -- is okay, but I think that the interface
should apply the same constraint at all levels.)

--Andy

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help