Re: [PATCH 7/8] cgroup: Add documentation for cgroup namespaces
From: Serge Hallyn <hidden>
Date: 2015-12-28 21:13:35
Also in:
cgroups, lkml
On Mon Dec 28 2015 09:47:35 AM PST, Tejun Heo [off-list ref] wrote:
Hello, I did some heavy editing of the documentation. How does this look?
Thanks Tejun, just three things (which come from my version):
quoted hunk ↗ jump to hunk
Did I miss anything? Thanks. --- Documentation/cgroup.txt | 146 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+)--- a/Documentation/cgroup.txt +++ b/Documentation/cgroup.txt@@ -47,6 +47,11 @@ CONTENTS 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback +6. Namespace + 6-1. Basics + 6-2. The Root and Views + 6-3. Migration and setns(2) + 6-4. Interaction with Other NamespacesP. Information on Kernel Programming P-1. Filesystem Support for Writeback D. Deprecated v1 Core Features@@ -1013,6 +1018,147 @@ writeback as follows.vm.dirty[_background]_ratio. +6. Namespace + +6-1. Basics + +cgroup namespace provides a mechanism to virtualize the view of the +"/proc/$PID/cgroup" file
and cgroup mounts
. The CLONE_NEWCGROUP clone flag can be used +with clone(2) and unshare(2) to create a new cgroup namespace. The +process running inside the cgroup namespace will have its +"/proc/$PID/cgroup" output restricted to cgroupns root. The cgroupns +root is the cgroup of the process at the time of creation of the +cgroup namespace. + +Without cgroup namespace, the "/proc/$PID/cgroup" file shows the +complete path of the cgroup of a process. In a container setup where +a set of cgroups and namespaces are intended to isolate processes the +"/proc/$PID/cgroup" file may leak potential system level information +to the isolated processes. For Example: + + # cat /proc/self/cgroup + 0::/batchjobs/container_id1 + +The path '/batchjobs/container_id1' can be considered as system-data +and undesirable to expose to the isolated processes. cgroup namespace +can be used to restrict visibility of this path. For example, before +creating a cgroup namespace, one would see: + + # ls -l /proc/self/ns/cgroup + lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] + # cat /proc/self/cgroup + 0::/batchjobs/container_id1 + +After unsharing a new namespace, the view changes. + + # ls -l /proc/self/ns/cgroup + lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] + # cat /proc/self/cgroup + 0::/ + +When some thread from a multi-threaded process unshares its cgroup +namespace, the new cgroupns gets applied to the entire process (all +the threads). This is natural for the v2 hierarchy; however, for the +legacy hierarchies, this may be unexpected. + +A cgroup namespace is alive as long as there are processes inside it.
Or mounts pinning it.
+When the last process exits
or the last mount is umounted,
, the cgroup namespace is destroyed. The
+cgroupns root and the actual cgroups remain.
+
+
+6-2. The Root and Views
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running. For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root. For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup.
+
+ # ~/unshare -c # unshare cgroupns in some cgroup
+ # cat /proc/self/cgroup
+ 0::/
+ # mkdir sub_cgrp_1
+ # echo 0 > sub_cgrp_1/cgroup.procs
+ # cat /proc/self/cgroup
+ 0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+
+ # sleep 100000 &
+ [1] 7353
+ # echo 7353 > sub_cgrp_1/cgroup.procs
+ # cat /proc/7353/cgroup
+ 0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible:
+
+ $ cat /proc/7353/cgroup
+ 0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown. For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see
+
+ # cat /proc/7353/cgroup
+ 0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+6-3. Migration and setns(2)
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroupsthis really means two things - write DAC access to the cgroupfs files, and access to the directories through a cgroupfs mount. Not sure if that should be spelled out.
. For +example, from inside a namespace with cgroupns root at +/batchjobs/container_id1, and assuming that the global hierarchy is +still accessible inside cgroupns: + + # cat /proc/7353/cgroup + 0::/sub_cgrp_1 + # echo 7353 > batchjobs/container_id2/cgroup.procs + # cat /proc/7353/cgroup + 0::/../container_id2 + +Note that this kind of setup is not encouraged. A task inside cgroup +namespace should only be exposed to its own cgroupns hierarchy. + +setns(2) to another cgroup namespace is allowed when: + +(a) the process has CAP_SYS_ADMIN against its current user namespace +(b) the process has CAP_SYS_ADMIN against the target cgroup + namespace's userns + +No implicit cgroup changes happen with attaching to another cgroup +namespace. It is expected that the someone moves the attaching +process under the target cgroup namespace root. + + +6-4. Interaction with Other Namespaces + +Namespace specific cgroup hierarchy can be mounted by a process +running inside a non-init cgroup namespace. + + # mount -t cgroup2 none $MOUNT_POINT + +This will mount the unified cgroup hierarchy with cgroupns root as the +filesystem root. The process needs CAP_SYS_ADMIN against its user and +mount namespaces. + +The virtualization of /proc/self/cgroup file combined with restricting +the view of cgroup hierarchy by namespace-private cgroupfs mount +provides a properly isolated cgroup view inside the container. + + P. Information on Kernel Programming This section contains kernel programming information in the areas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/