Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Andy Lutomirski <luto@kernel.org>
Date: 2019-05-11 22:40:02
Also in:
linux-arch, linux-fsdevel, lkml
On May 11, 2019, at 10:21 AM, Linus Torvalds [off-list ref] wrote:quoted
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski [off-list ref] wrote: A better “spawn” API should fix this.Andy, stop with the "spawn would be better".
It doesn’t have to be spawn per se. But the current situation sucks.
Notice? None of the real problems are about execve or would be solved by any spawn API. You just think that because you've apparently been talking to too many MS people that think fork (and thus indirectly execve()) is bad process management.
I’ve literally never spoken to an MS person about it. What container managers and init systems *want* is a way to drop privileges, change namespaces, etc and then run something in a controlled way so that the intermediate states aren’t dangerous. An API for this could be spawn-like or exec-like — that particular distinction is beside the point. Having personally written code that mucks with namepsaces, I've wanted two particular abilities that are both quite awkward: a) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary in the container's filesystem, all atomically enough that I don't need to worry about accidentally leaking privileges into the container. A super-duper-non-dumpable mode would kind of allow this, but I'd worry that there's some other hole besides ptrace() and /proc/self. b) Change all my UIDs and GIDs to match a container, enter that container's namespaces, and run some binary that is *not* in the container's filesystem. This happens, for example, if the container's mount namespace has no exec mounts at all. We don't have a fantastic way to do this at all right now due to /proc/self/exe. Regardless, the actual CVE at hand would have been nicely avoided if writing to /proc/self/exe didn’t work, and I see no reason we can’t make that happen. I suppose we could also consider a change to disable /proc/self/exe if it's not reachable from /proc/self/root. By "disable", I mean that readlink() should maybe still work, but actually trying to open it could probably fail safely.