Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()

From: Andy Lutomirski <luto@amacapital.net>
Date: 2024-05-06 17:30:04
Also in: linux-fsdevel, lkml

Replying to a couple emails at once...

On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai [off-list ref] wrote:

On 2024-04-28, Andy Lutomirski [off-list ref] wrote:

quoted

On Apr 26, 2024, at 6:39 AM, Stas Sergeev [off-list ref] wrote:
This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
It is needed to perform an open operation with the creds that were in
effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
flag. This allows the process to pre-open some dirs and switch eUID
(and other UIDs/GIDs) to the less-privileged user, while still retaining
the possibility to open/create files within the pre-opened directory set.

I’ve been contemplating this, and I want to propose a different solution.

First, the problem Stas is solving is quite narrow and doesn’t
actually need kernel support: if I want to write a user program that
sandboxes itself, I have at least three solutions already.  I can make
a userns and a mountns; I can use landlock; and I can have a separate
process that brokers filesystem access using SCM_RIGHTS.

But what if I want to run a container, where the container can access
a specific host directory, and the contained application is not aware
of the exact technology being used?  I recently started using
containers in anger in a production setting, and “anger” was
definitely the right word: binding part of a filesystem in is
*miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
actually gave up on making one of my use cases work on a Fedora
system.

Here’s what I wanted to do, logically, in production: pick a host
directory, pick a host *principal* (UID, GID, label, etc), and have
the *entire container* access the directory as that principal. This is
what happens automatically if I run the whole container as a userns
with only a single UID mapped, but I don’t really want to do that for
a whole variety and of reasons.

So maybe reimagining Stas’ feature a bit can actually solve this
problem.  Instead of a special dirfd, what if there was a special
subtree (in the sense of open_tree) that captures a set of creds and
does all opens inside the subtree using those creds?

This isn’t a fully formed proposal, but I *think* it should be
generally fairly safe for even an unprivileged user to clone a subtree
with a specific flag set to do this. Maybe a capability would be
needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
this to a daemon if a privilege is needed, and getting the API right
might be a bit tricky.

Tying this to an actual mount rather than a file handle sounds like a
more plausible proposal than OA2_CRED_INHERIT, but it just seems that
this is going to re-create all of the work that went into id-mapped
mounts but with the extra-special step of making the generic VFS
permissions no longer work normally (unless the idea is that everything
would pretend to be owned by current_fsuid()?).

I was assuming that the owner uid and gid would be show to stat, etc
as usual.  But the permission checks would be done against the
captured creds.

IMHO it also isn't enough to just make open work, you need to make all
operations work (which leads to a non-trivial amount of
filesystem-specific handling), which is just idmapped mounts. A lot of
work was put into making sure that is safe, and collapsing owners seems
like it will cause a lot of headaches.

I also find it somewhat amusing that this proposal is to basically give
up on multi-user permissions for this one directory tree because it's
too annoying to deal with. In that case, isn't chmod 777 a simpler
solution? (I'm being a bit flippant, of course there is a difference,
but the net result is that all users in the container would have the
same permissions with all of the fun issues that implies.)

In short, AFAICS idmapped mounts pretty much solve this problem (minus
the ability to collapse users, which I suspect is not a good idea in
general)?

With my kernel hat on, maybe I agree.  But with my *user* hat on, I
think I pretty strongly disagree.  Look, idmapis lousy for
unprivileged use:

$ install -m 0700 -d test_directory
$ echo 'hi there' >test_directory/file
$ podman run -it --rm
--mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
# cat /tmp/file
hi there

<-- Hey, look, this kind of works!

# setpriv --reuid=1 ls /tmp
ls: cannot open directory '/tmp': Permission denied

<-- Gee, thanks, Linux!


Obviously this is a made up example.  But it's quite analogous to a
real example.  Suppose I want to make a directory that will contain
some MySQL data.  I don't want to share this directory with anyone
else, so I set its mode to 0700.  Then I want to fire up an
unprivileged MySQL container, so I build or download it, and then I
run it and bind my directory to /var/lib/mysql and I run it.  I don't
need to think about UIDs or anything because it's 2024 and containers
just work.  Okay, I need to setenforce 0 because I'm on Fedora and
SELinux makes absolutely no sense in a container world, but I can live
with that.

Except that it doesn't work!  Because unless I want to manually futz
with the idmaps to get mysql to have access to the directory inside
the container, only *root* gets to get in.  But I bet that even
futzing with the idmap doesn't work, because software like mysql often
expects that root *and* a user can access data.  And some software
even does privilege separation and uses more than one UID.

So I want a way to give *an entire container* access to a directory.
Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
could learn a way to squash multiple ids down to one.  Or maybe
something like my silly credential-capturing mount proposal could
work.  But the status quo is not actually amazing IMO.

I haven't looked at the idmap implementation nearly enough to have any
opinion as to whether squashing UID is practical or whether there's
any sensible way to specify it in the configuration.

On Apr 29, 2024, at 2:12 AM, Christian Brauner [off-list ref] wrote:

Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
move_mount(). I rewrote the bind-mount logic in systemd based on that
and util-linux uses that as well now.
https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

Yep, I remember that.

quoted

Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
actually gave up on making one of my use cases work on a Fedora
system.

Here’s what I wanted to do, logically, in production: pick a host
directory, pick a host *principal* (UID, GID, label, etc), and have
the *entire container* access the directory as that principal. This is
what happens automatically if I run the whole container as a userns
with only a single UID mapped, but I don’t really want to do that for
a whole variety and of reasons.

You're describing idmapped mounts for the most part which are upstream
and are used in exactly that way by a lot of userspace.

See above...

quoted

So maybe reimagining Stas’ feature a bit can actually solve this
problem.  Instead of a special dirfd, what if there was a special
subtree (in the sense of open_tree) that captures a set of creds and
does all opens inside the subtree using those creds?

That would mean override creds in the VFS layer when accessing a
specific subtree which is a terrible idea imho. Not just because it will
quickly become a potential dos when you do that with a lot of subtrees
it will also have complex interactions with overlayfs.

I was deliberately talking about semantics, not implementation. This
may well be impossible to implement straightforwardly.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help