Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
From: Jann Horn <jannh@google.com>
Date: 2019-12-16 22:54:26
Also in:
linux-fsdevel, lkml
On Mon, Dec 16, 2019 at 9:05 PM Laurent Vivier [off-list ref] wrote:
Le 16/12/2019 à 20:08, Jann Horn a écrit :quoted
On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier [off-list ref] wrote:quoted
This patch allows to have a different binfmt_misc configuration for each new user namespace. By default, the binfmt_misc configuration is the one of the previous level, but if the binfmt_misc filesystem is mounted in the new namespace a new empty binfmt instance is created and used in this namespace. For instance, using "unshare" we can start a chroot of another architecture and configure the binfmt_misc interpreter without being root to run the binaries in this chroot.How do you ensure that when userspace is no longer using the user namespace and mount namespace, the entries and the binfmt_misc superblock are deleted? As far as I can tell from looking at the code, at the moment, if I create a user namespace+mount namespace, mount binfmt_misc in there, register a file format and then let all processes inside the namespaces exit, the binfmt_misc mount will be kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries will also stay in memory. [...]Do you have any idea how I can fix this issue?
I think the easiest way (keeping in mind that we want to avoid having to fiddle around with reference loops, where e.g. interpreter executable files opened by binfmt_misc have references back to the user namespace through ->f_cred) would be to add a new patch in front of this one that changes the semantics such that when binfmt_misc is unmounted, all the existing format registrations are deleted. That's probably also nicer from the perspective of inspectability. It could in theory break stuff, but I think that's probably somewhat unlikely. Still, it'd be an API change, and therefore you should CC linux-api@ on such a change.
quoted
quoted
@@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2; - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); + ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns); + err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt, + &ns->entry_count);When you call simple_pin_fs() here, the user namespace of `current` and the user namespace of the superblock are not necessarily related. So simple_pin_fs() may end up taking a reference on the mountpoint for a user namespace that has nothing to do with the namespace for which an entry is being created.Do you have any idea how I can fix this issue?
If you fix the memory leak the way I suggested, this wouldn't be a problem anymore either.
quoted
quoted
+static void bm_free(struct fs_context *fc) +{ + if (fc->s_fs_info) + put_user_ns(fc->s_fs_info); +}Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?So I understand the if is unnecessary and I will remove it.
Your code was actually exactly right, I didn't understand how fc->s_fs_info works.
quoted
quoted
static int bm_get_tree(struct fs_context *fc) { - return get_tree_single(fc, bm_fill_super); + return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));get_user_ns() increments the refcount of the namespace, but in the case where a binfmt_misc mount already exists, that refcount is never dropped, right? That would be a security bug, since an attacker could overflow the refcount of the user namespace and then trigger a UAF. (And the refcount hardening won't catch it because user namespaces still use raw atomics instead of refcount_t.)Do you have any idea how I can fix this issue?
Ah, this was actually fine. I missed that get_tree_keyed() stashes that pointer in fc->s_fs_info.
quoted
quoted
+#if IS_ENABLED(CONFIG_BINFMT_MISC)Nit: Isn't this kind of check normally written as "#ifdef"?What is the difference?
As explained in Documentation/process/coding-style.rst and the relevant header, IS_ENABLED() is for inline use in C expressions.