Re: [RFC PATCH] cgroup: add cgroup.signal

From: Christian Brauner <hidden>
Date: 2021-04-26 15:29:32

On Mon, Apr 26, 2021 at 04:42:45PM +0200, Michal KoutnÃ½ wrote:

Hello Christian,
I have some questions to understand the motivation here.

On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner [off-list ref] wrote:

quoted

- Signals are specified by writing the signal number into cgroup.signal.
  An alternative would be to allow writing the signal name but I don't
  think that's worth it. Callers like systemd can easily do a snprintf()
  with the signal's define/enum.
- Since signaling is a one-time event and we're holding cgroup_mutex()
  as we do for freezer we don't need to worry about tasks joining the
  cgroup while we're signaling the cgroup. Freezer needed to care about
  this because a task could join or leave frozen/non-frozen cgroups.
  Since we only support SIGKILL currently and SIGKILL works for frozen
  tasks there's also not significant interaction with frozen cgroups.
- Since signaling leads to an event and not a state change the
  cgroup.signal file is write-only.

Have you considered accepting a cgroup fd to pidfd_send_signal and
realize this operation through this syscall? (Just asking as it may
prevent some of these consequences whereas bring other unclarities.)

That's semantically quite wrong on several fronts, I think.
pidfd_send_signal() operates on pidfds (and for a quirky historical
reason /proc/<pid> though that should die at some point). Making this
operate on cgroup fds is essentially implicit multiplexing which is
pretty nasty imho. In addition, this is a cgroup concept not a pidfd
concept.
I've also removed the "arbitrary signal" feature from the cgroup.signal
knob. I think Roman's right that it should simply be cgroup.kill.

quoted

- Since we currently only support SIGKILL we don't need to generate a
  separate notification and can rely on the unpopulated notification
  meachnism. If we support more signals we can introduce a separate
  notification in cgroup.events.

What kind of notification do you have in mind here?

This is now irrelevant with dumbing this down to cgroup.kill.

quoted

- Freezer doesn't care about tasks in different pid namespaces, i.e. if
  you have two tasks in different pid namespaces the cgroup would still
  be frozen.
  The cgroup.signal mechanism should consequently behave the same way,
  i.e.  signal all processes and ignore in which pid namespace they
  exist. This would obviously mean that if you e.g. had a task from an
  ancestor pid namespace join a delegated cgroup of a container in a
  child pid namespace the container can kill that task. But I think this
  is fine and actually the semantics we want since the cgroup has been
  delegated.

What do you mean by a delegated cgroup in this context?

Hm? I mean a cgroup that is delegated to a specific user according to
the official cgroup2 documentation.

brauner@wittgenstein|/sys/fs/cgroup/payload.f1

ls -al

total 0
drwxrwxr-x  6 root   100000 0 Apr 25 11:51 .
dr-xr-xr-x 41 root   root   0 Apr 25 11:51 ..
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.controllers
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.events
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.freeze
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.max.depth
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.max.descendants
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.procs
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.stat
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.subtree_control
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.threads
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.type
-rw-r--r--  1 root   root   0 Apr 26 17:20 cpu.pressure
-r--r--r--  1 root   root   0 Apr 26 17:20 cpu.stat
drwxr-xr-x  2 100000 100000 0 Apr 25 11:51 init.scope
-rw-r--r--  1 root   root   0 Apr 26 17:20 io.pressure
drwxr-xr-x  2 100000 100000 0 Apr 25 11:51 .lxc
-rw-r--r--  1 root   root   0 Apr 26 17:20 memory.pressure
drwxr-xr-x 78 100000 100000 0 Apr 26 15:24 system.slice
drwxr-xr-x  2 100000 100000 0 Apr 25 11:52 user.slice

quoted

- We're holding the read-side of tasklist lock while we're signaling
  tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
  of tasklist lock walking all processes and is a way for unprivileged
  users to trigger tasklist lock being held for a long time. In contrast
  it would require a delegated cgroup with lots of processes and a deep
  hierarchy to allow for something similar with this interface.

I'd better not proliferate tasklist_lock users if it's avoidable (such
as freezer does).

quoted

Fwiw, in addition to the system manager and container use-cases I think
this has the potential to be picked up by the "kill" tool. In the future
I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated

(OT: FTR, there's `systemctl kill` already ;-))

Not every distro uses systemd. :)
And actually systemd is one of the users of this feature.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help