On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote:
On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <
James.Bottomley@hansenpartnership.com> wrote:
quoted
On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
[...]
quoted
With this patch series we simply introduce the ability to create
fsid mappings that are different from the id mappings of a user
namespace. The whole feature set is placed under a config option
that defaults to false.
In the usual case of running an unprivileged container we will
have setup an id mapping, e.g. 0 100000 100000. The on-disk
mapping will correspond to this id mapping, i.e. all files which
we want to appear as 0:0 inside the user namespace will be
chowned to 100000:100000 on the host. This works, because
whenever the kernel needs to do a filesystem access it will
lookup the corresponding uid and gid in the idmapping tables of
the container.
Now think about the case where we want to have an id mapping of 0
100000 100000 but an on-disk mapping of 0 300000 100000 which is
needed to e.g. share a single on-disk mapping with multiple
containers that all have different id mappings.
This will be problematic. Whenever a filesystem access is
requested, the kernel will now try to lookup a mapping for 300000
in the id mapping tables of the user namespace but since there is
none the files will appear to be owned by the overflow id, i.e.
usually 65534:65534 or nobody:nogroup.
With fsid mappings we can solve this by writing an id mapping of
0 100000 100000 and an fsid mapping of 0 300000 100000. On
filesystem access the kernel will now lookup the mapping for
300000 in the fsid mapping tables of the user namespace. And
since such a mapping exists, the corresponding files will have
correct ownership.
How do we parametrise this new fsid shift for the unprivileged use
case? For newuidmap/newgidmap, it's easy because each user gets a
dedicated range and everything "just works (tm)". However, for the
fsid mapping, assuming some newfsuid/newfsgid tool to help, that
tool has to know not only your allocated uid/gid chunk, but also
the offset map of the image. The former is easy, but the latter is
going to vary by the actual image ... well unless we standardise
some accepted shift for images and it simply becomes a known static
offset.
For unprivileged runtimes, I would expect images to be unshifted and
be unpacked from within a userns.
For images whose resting format is an archive like tar, I concur.
So your unprivileged user would be allowed a uid/gid range through
/etc/subuid and /etc/subgid and allowed to use them through
newuidmap/newgidmap.In that namespace, you can then pull
and unpack any images/layers you may want and the resulting fs tree
will look correct from within that namespace.
All that is possible today and is how for example unprivileged LXC
works right now.
I do have a counter example, but it might be more esoteric: I do use
unprivileged architecture emulation containers to maintain actual
physical system boot environments. These are stored as mountable disk
images, not as archives, so I do need a simple remapping ... however, I
think this use case is simple: it's a back shift along my owned uid/gid
range, so tools for allowing unprivileged use can easily cope with this
use case, so the use is either fsid identity or fsid back along
existing user_ns mapping.
What this patchset then allows is for containers to have differing
uid/gid maps while still being based off the same image or layers.
In this scenario, you would carve a subset of your main uid/gid map
for each container you run and run them in a child user namespace
while setting up a fsuid/fsgid map such that their filesystem access
do not follow their uid/gid map. This then results in proper
isolation for processes, networks, ... as everything runs as
different kuid/kgid but the VFS view will be the same in all
containers.
Who owns the shifted range of the image ... all tenants or none?
Shared storage between those otherwise isolated containers would also
work just fine by simply bind-mounting the same path into two or more
containers.
Now one additional thing that would be safe for a setuid wrapper to
allow would be for arbitrary mapping of any of the uid/gid that the
user owns to be used within the fsuid/fsgid map. One potential use
for this would be to create any number of user namespaces, each with
their own mapping for uid 0 while still having all VFS access be
mapped to the user that spawned them (say uid=1000, gid=1000).
Note that in our case, the intended use for this is from a privileged
runtime where our images would be unshifted as would be the container
storage and any shared storage for containers. The security model
effectively relying on properly configured filesystem permissions and
mount namespaces such that the content of those paths can never be
seen by anyone but root outside of those containers (and therefore
avoids all the issues around setuid/setgid/fscaps).
Yes, I understand ... all orchestration systems are currently hugely
privileged. However, there is interest in getting them down to only
"slightly privileged".
James
We will then be able to allocate distinct, random, ranges of 65536
uids/gids (or more) for each container without ever having to do any
uid/gid shifting at the filesystem layer or run into issues when
having to setup shared storage between containers or attaching
external storage volumes to those containers.