Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
From: Andy Lutomirski <luto@amacapital.net>
Date: 2024-05-06 17:30:04
Also in:
linux-fsdevel, lkml
Replying to a couple emails at once... On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai [off-list ref] wrote:
On 2024-04-28, Andy Lutomirski [off-list ref] wrote:quoted
quoted
On Apr 26, 2024, at 6:39 AM, Stas Sergeev [off-list ref] wrote: This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. It is needed to perform an open operation with the creds that were in effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW flag. This allows the process to pre-open some dirs and switch eUID (and other UIDs/GIDs) to the less-privileged user, while still retaining the possibility to open/create files within the pre-opened directory set.I’ve been contemplating this, and I want to propose a different solution. First, the problem Stas is solving is quite narrow and doesn’t actually need kernel support: if I want to write a user program that sandboxes itself, I have at least three solutions already. I can make a userns and a mountns; I can use landlock; and I can have a separate process that brokers filesystem access using SCM_RIGHTS. But what if I want to run a container, where the container can access a specific host directory, and the contained application is not aware of the exact technology being used? I recently started using containers in anger in a production setting, and “anger” was definitely the right word: binding part of a filesystem in is *miserable*. Getting the DAC rules right is nasty. LSMs are worse. Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I actually gave up on making one of my use cases work on a Fedora system. Here’s what I wanted to do, logically, in production: pick a host directory, pick a host *principal* (UID, GID, label, etc), and have the *entire container* access the directory as that principal. This is what happens automatically if I run the whole container as a userns with only a single UID mapped, but I don’t really want to do that for a whole variety and of reasons. So maybe reimagining Stas’ feature a bit can actually solve this problem. Instead of a special dirfd, what if there was a special subtree (in the sense of open_tree) that captures a set of creds and does all opens inside the subtree using those creds? This isn’t a fully formed proposal, but I *think* it should be generally fairly safe for even an unprivileged user to clone a subtree with a specific flag set to do this. Maybe a capability would be needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating this to a daemon if a privilege is needed, and getting the API right might be a bit tricky.Tying this to an actual mount rather than a file handle sounds like a more plausible proposal than OA2_CRED_INHERIT, but it just seems that this is going to re-create all of the work that went into id-mapped mounts but with the extra-special step of making the generic VFS permissions no longer work normally (unless the idea is that everything would pretend to be owned by current_fsuid()?).
I was assuming that the owner uid and gid would be show to stat, etc as usual. But the permission checks would be done against the captured creds.
IMHO it also isn't enough to just make open work, you need to make all operations work (which leads to a non-trivial amount of filesystem-specific handling), which is just idmapped mounts. A lot of work was put into making sure that is safe, and collapsing owners seems like it will cause a lot of headaches. I also find it somewhat amusing that this proposal is to basically give up on multi-user permissions for this one directory tree because it's too annoying to deal with. In that case, isn't chmod 777 a simpler solution? (I'm being a bit flippant, of course there is a difference, but the net result is that all users in the container would have the same permissions with all of the fun issues that implies.) In short, AFAICS idmapped mounts pretty much solve this problem (minus the ability to collapse users, which I suspect is not a good idea in general)?
With my kernel hat on, maybe I agree. But with my *user* hat on, I think I pretty strongly disagree. Look, idmapis lousy for unprivileged use: $ install -m 0700 -d test_directory $ echo 'hi there' >test_directory/file $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] # cat /tmp/file hi there <-- Hey, look, this kind of works! # setpriv --reuid=1 ls /tmp ls: cannot open directory '/tmp': Permission denied <-- Gee, thanks, Linux! Obviously this is a made up example. But it's quite analogous to a real example. Suppose I want to make a directory that will contain some MySQL data. I don't want to share this directory with anyone else, so I set its mode to 0700. Then I want to fire up an unprivileged MySQL container, so I build or download it, and then I run it and bind my directory to /var/lib/mysql and I run it. I don't need to think about UIDs or anything because it's 2024 and containers just work. Okay, I need to setenforce 0 because I'm on Fedora and SELinux makes absolutely no sense in a container world, but I can live with that. Except that it doesn't work! Because unless I want to manually futz with the idmaps to get mysql to have access to the directory inside the container, only *root* gets to get in. But I bet that even futzing with the idmap doesn't work, because software like mysql often expects that root *and* a user can access data. And some software even does privilege separation and uses more than one UID. So I want a way to give *an entire container* access to a directory. Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps could learn a way to squash multiple ids down to one. Or maybe something like my silly credential-capturing mount proposal could work. But the status quo is not actually amazing IMO. I haven't looked at the idmap implementation nearly enough to have any opinion as to whether squashing UID is practical or whether there's any sensible way to specify it in the configuration.
On Apr 29, 2024, at 2:12 AM, Christian Brauner [off-list ref] wrote: Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and move_mount(). I rewrote the bind-mount logic in systemd based on that and util-linux uses that as well now. https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
Yep, I remember that.
quoted
Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I actually gave up on making one of my use cases work on a Fedora system. Here’s what I wanted to do, logically, in production: pick a host directory, pick a host *principal* (UID, GID, label, etc), and have the *entire container* access the directory as that principal. This is what happens automatically if I run the whole container as a userns with only a single UID mapped, but I don’t really want to do that for a whole variety and of reasons.You're describing idmapped mounts for the most part which are upstream and are used in exactly that way by a lot of userspace.
See above...
quoted
So maybe reimagining Stas’ feature a bit can actually solve this problem. Instead of a special dirfd, what if there was a special subtree (in the sense of open_tree) that captures a set of creds and does all opens inside the subtree using those creds?That would mean override creds in the VFS layer when accessing a specific subtree which is a terrible idea imho. Not just because it will quickly become a potential dos when you do that with a lot of subtrees it will also have complex interactions with overlayfs.
I was deliberately talking about semantics, not implementation. This may well be impossible to implement straightforwardly.