Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
From: Aleksa Sarai <hidden>
Date: 2019-05-11 17:26:45
Also in:
linux-arch, linux-fsdevel, lkml
On 2019-05-11, Andy Lutomirski [off-list ref] wrote:
quoted
quoted
I've lost track of the context here, but it seems to me that mitigating attacks involving accidental following of /proc links shouldn't depend on dumpability. What's the actual problem this is trying to solve again?The one actual security problem that I've seen related to this is CVE-2019-5736. There is a write-up of it at <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html> under "Successful approach", but it goes more or less as follows: A container is running that doesn't use user namespaces (because for some reason I don't understand, apparently some people still do that). An evil process is running inside the container with UID 0 (as in, GLOBAL_ROOT_UID); so if the evil process inside the container was able to reach root-owned files on the host filesystem, it could write into them. The container engine wants to spawn a new process inside the container. It forks off a child that joins the container's namespaces (including PID and mount namespaces), and then the child calls execve() on some path in the container.I think that, at this point, the task should be considered owned by the container. Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid. In fact, the process is arguably pwned even *before* execve.
Yeah, execve is just the vector (though in this case it's done in order to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar but was attacking a dirfd that runc had open into the container (LXC had a very similar bug too) -- setting !mm->dumpable was one of the workarounds we had for this.
A better “spawn” API should fix this. In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy.
This is generally how we treat containers as runtime authors, but it's not a trivial thing to get right. In many cases the kernel APIs are working against you -- Christian and myself have written a fair few patches to fix holes in the kernel APIs so we can avoid these kinds of assumptions. But yes, one of the most risky parts of a container runtime is when you're attaching to a running container because all of the helpful introspection APIs in /proc/ suddenly become a security nightmare. A better "spawn a process in these namespaces" API might help improve the situation (or at least, I hope it would).
quoted
- You can use /proc/*/exe to get a writable fd.This is IMO the real bug.
I will try to send an RFC of the patchset I have for this next week or so. Funnily enough, currently /proc/*/exe has the write bit set in its "mode" (my series fixes this). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>
Attachments
- signature.asc [application/pgp-signature] 833 bytes