Re: [RFC PATCH] cgroup: add cgroup.signal
From: Christian Brauner <hidden>
Date: 2021-04-26 15:29:32
On Mon, Apr 26, 2021 at 04:42:45PM +0200, Michal Koutný wrote:
Hello Christian, I have some questions to understand the motivation here. On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner [off-list ref] wrote:quoted
- Signals are specified by writing the signal number into cgroup.signal. An alternative would be to allow writing the signal name but I don't think that's worth it. Callers like systemd can easily do a snprintf() with the signal's define/enum. - Since signaling is a one-time event and we're holding cgroup_mutex() as we do for freezer we don't need to worry about tasks joining the cgroup while we're signaling the cgroup. Freezer needed to care about this because a task could join or leave frozen/non-frozen cgroups. Since we only support SIGKILL currently and SIGKILL works for frozen tasks there's also not significant interaction with frozen cgroups. - Since signaling leads to an event and not a state change the cgroup.signal file is write-only.Have you considered accepting a cgroup fd to pidfd_send_signal and realize this operation through this syscall? (Just asking as it may prevent some of these consequences whereas bring other unclarities.)
That's semantically quite wrong on several fronts, I think. pidfd_send_signal() operates on pidfds (and for a quirky historical reason /proc/<pid> though that should die at some point). Making this operate on cgroup fds is essentially implicit multiplexing which is pretty nasty imho. In addition, this is a cgroup concept not a pidfd concept. I've also removed the "arbitrary signal" feature from the cgroup.signal knob. I think Roman's right that it should simply be cgroup.kill.
quoted
- Since we currently only support SIGKILL we don't need to generate a separate notification and can rely on the unpopulated notification meachnism. If we support more signals we can introduce a separate notification in cgroup.events.What kind of notification do you have in mind here?
This is now irrelevant with dumbing this down to cgroup.kill.
quoted
- Freezer doesn't care about tasks in different pid namespaces, i.e. if you have two tasks in different pid namespaces the cgroup would still be frozen. The cgroup.signal mechanism should consequently behave the same way, i.e. signal all processes and ignore in which pid namespace they exist. This would obviously mean that if you e.g. had a task from an ancestor pid namespace join a delegated cgroup of a container in a child pid namespace the container can kill that task. But I think this is fine and actually the semantics we want since the cgroup has been delegated.What do you mean by a delegated cgroup in this context?
Hm? I mean a cgroup that is delegated to a specific user according to the official cgroup2 documentation. brauner@wittgenstein|/sys/fs/cgroup/payload.f1
ls -al
total 0 drwxrwxr-x 6 root 100000 0 Apr 25 11:51 . dr-xr-xr-x 41 root root 0 Apr 25 11:51 .. -r--r--r-- 1 root root 0 Apr 26 17:20 cgroup.controllers -r--r--r-- 1 root root 0 Apr 26 17:20 cgroup.events -rw-r--r-- 1 root root 0 Apr 26 17:20 cgroup.freeze -rw-r--r-- 1 root root 0 Apr 26 17:20 cgroup.max.depth -rw-r--r-- 1 root root 0 Apr 26 17:20 cgroup.max.descendants -rw-rw-r-- 1 root 100000 0 Apr 25 11:51 cgroup.procs -r--r--r-- 1 root root 0 Apr 26 17:20 cgroup.stat -rw-rw-r-- 1 root 100000 0 Apr 25 11:51 cgroup.subtree_control -rw-rw-r-- 1 root 100000 0 Apr 25 11:51 cgroup.threads -rw-r--r-- 1 root root 0 Apr 26 17:20 cgroup.type -rw-r--r-- 1 root root 0 Apr 26 17:20 cpu.pressure -r--r--r-- 1 root root 0 Apr 26 17:20 cpu.stat drwxr-xr-x 2 100000 100000 0 Apr 25 11:51 init.scope -rw-r--r-- 1 root root 0 Apr 26 17:20 io.pressure drwxr-xr-x 2 100000 100000 0 Apr 25 11:51 .lxc -rw-r--r-- 1 root root 0 Apr 26 17:20 memory.pressure drwxr-xr-x 78 100000 100000 0 Apr 26 15:24 system.slice drwxr-xr-x 2 100000 100000 0 Apr 25 11:52 user.slice
quoted
- We're holding the read-side of tasklist lock while we're signaling tasks. That seems fine since kill(-1, SIGKILL) holds the read-side of tasklist lock walking all processes and is a way for unprivileged users to trigger tasklist lock being held for a long time. In contrast it would require a delegated cgroup with lots of processes and a deep hierarchy to allow for something similar with this interface.I'd better not proliferate tasklist_lock users if it's avoidable (such as freezer does).quoted
Fwiw, in addition to the system manager and container use-cases I think this has the potential to be picked up by the "kill" tool. In the future I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated(OT: FTR, there's `systemctl kill` already ;-))
Not every distro uses systemd. :) And actually systemd is one of the users of this feature.