Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

From: Aditya Kali <hidden>
Date: 2014-10-22 00:46:32
Also in: cgroups, lkml

On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski [off-list ref] wrote:

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali [off-list ref] wrote:

quoted

On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski [off-list ref] wrote:

quoted

On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali [off-list ref] wrote:

quoted

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski [off-list ref] wrote:

quoted

On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
[off-list ref] wrote:

quoted

I do wonder if we think of this as chcgrouproot if there is a simpler
implementation.

Could be.  I'll defer to Aditya for that one.

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container.

Why?  Concretely, why should this be in the kernel namespace code
instead of in userspace?

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

quoted

Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container.

So don't let them out.  None of the other namespaces have this kind of
constraint:

 - If you're in a mntns, you can still use fds from outside.
 - If you're in a netns, you can still use sockets from outside the namespace.
 - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
like that?  You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.

I am not if we should only depend on restricted visibility for this
though. More details below.

quoted

And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

I actually think it makes the semantics more complex.  The less policy
you stick in the kernel, the easier it is to understand the impact of
that policy.

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process doing the write is an admin process (with CAP_SYS_ADMIN in its
userns AND over P's cgroupns->user_ns).
If this write succeeds, then:
(a) process P's /proc/<pid>/cgroup does not show anything when viewed
by 'self' or any other process in P's cgrgroupns. I would really like
to avoid showing relative paths or paths outside the cgroupns-root
(b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
only able to mount and see cgroup hierarchy under its cgroupns-root
(d) if process P tries to write to any cgroup file outside of its
cgroupns-root (assuming that hierarchy is visible to it for whatever
reason), it will fail as in (1)

i.e., in summary, you can't escape out of cgroupns-root yourself. You
will need help from an admin process running under some parent
cgroupns-root to move you out. Is that workable for your usecase? Most
of the things above already happen with the current patch-set, so it
should be easy to enable this.

Though there are still some open issues like:
* what happens if you move all the processes out of /container1 and
then 'rmdir /container1'? As it is now, you won't be able to setns()
to that cgroupns anymore. But the cgroupns will still hang around
until the processes switch their cgroupns.
* should we then also allow setns() without first entering the
cgroupns-root? setns also checks the same conditions as in (a) plus it
checks that your current cgroup is descendant of target cgroupns-root.
Alternatively we can special-case setns() to own cgroupns so that it
doesn't fail.
* migration for these processes will be tricky, if not impossible. But
the admin trying to do this probably doesn't care about it or will
provision for it.

quoted

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).

The process should see the cgroup path relative to its cgroup ns.
Whether this requires a new /proc mount or happens automatically is an
open question.  (I *hate* procfs for reasons like this.)

quoted

* how should future setns() and unshare() by such process behave?

Open question.

quoted

* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result

You could disallow that and instead require 'mount -t cgroup -o
cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
relative to the caller's cgroupns.

quoted

* container will not remain migratable

Why not?

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

So don't do that?

Lot of non-cgroup-manager userspace processes have legitimate reasons
to read /proc/self/cgroup. One example is to register for OOM
notifications. Migratability of the container is also very important.
So "don't do that" is not always an option :)

--Andy

-- 
Aditya

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help