Re: [Documentation] State of CPU controller in cgroup v2

From: Andy Lutomirski <hidden>
Date: 2016-09-16 16:29:38
Also in: cgroups, lkml

On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra [off-list ref] wrote:

On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:

quoted

On Sep 16, 2016 12:51 AM, "Peter Zijlstra" [off-list ref] wrote:

quoted

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:

quoted

BTW, Mike keeps mentioning exclusive cgroups as problematic with the
no-internal-tasks constraints.  Do exclusive cgroups still exist in
cgroup2?  Could we perhaps just remove that capability entirely?  I've
never understood what problem exlusive cpusets and such solve that
can't be more comprehensibly solved by just assigning the cpusets the
normal inclusive way.

Without exclusive sets we cannot split the sched_domain structure.
Which leads to not being able to actually partition things. That would
break DL for one.

Can you sketch out a toy example?

[ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]


  mkdir /cpuset

  mount -t cgroup -o cpuset none /cpuset

  mkdir /cpuset/A
  mkdir /cpuset/B

  cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
  echo 0 > /cpuset/A/cpuset.mems

  cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
  echo 1 > /cpuset/B/cpuset.mems

  # move all movable tasks into A
  cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done

  # kill machine wide load-balancing
  echo 0 > /cpuset/cpuset.sched_load_balance

  # now place 'special' tasks in B


This partitions the scheduler into two, one for each node.

Hereafter no task will be moved from one node to another. The
load-balancer is split in two, one balances in A one balances in B
nothing crosses. (It is important that A.cpus and B.cpus do not
intersect.)

Ideally no task would remain in the root group, back in the day we could
actually do this (with exception of the cpu bound kernel threads), but
this has significantly regressed :-(
(still hate the workqueue affinity interface)

I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
land there?

As is, tasks that are left in the root group get balanced within
whatever domain they ended up in.

quoted

And what's DL?

SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
CPU affinities (because that doesn't make sense). The only way to
restrict it is to partition.

'Global' because you can partition it. If you reduce your system to
single CPU partitions you'll reduce to P-EDF.

(The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
partition scheme, it however does support sched_affinity, but using it
gives 'interesting' schedulability results -- call it a historic
accident).

Hmm, I didn't realize that the deadline scheduler was global.  But
ISTM requiring the use of "exclusive" to get this working is
unfortunate.  What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)?  Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
your domain"?


Note that related, but differently, we have the isolcpus boot parameter
which creates single CPU partitions for all listed CPUs and gives the
rest to the root cpuset. Ideally we'd kill this option given its a boot
time setting (for something which is trivially to do at runtime).

But this cannot be done, because that would mean we'd have to start with
a !0 cpuset layout:

                '/'
                load_balance=0
            /              \
        'system'        'isolated'
        cpus=~isolcpus  cpus=isolcpus
                        load_balance=0

And start with _everything_ in the /system group (inclding default IRQ
affinities).

Of course, that will break everything cgroup :-(

I would actually *much* prefer this over the status quo.  I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot.  (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel.  Userspace would be
able to request that a different cgroup be used for newly-created
kernel threads.)

Heck, even systemd would probably prefer this.  Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help