Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace

From: Jann Horn <jannh@google.com>
Date: 2019-12-16 22:54:26
Also in: linux-fsdevel, lkml

On Mon, Dec 16, 2019 at 9:05 PM Laurent Vivier [off-list ref] wrote:

Le 16/12/2019 à 20:08, Jann Horn a écrit :

quoted

On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier [off-list ref] wrote:

quoted

This patch allows to have a different binfmt_misc configuration
for each new user namespace. By default, the binfmt_misc configuration
is the one of the previous level, but if the binfmt_misc filesystem is
mounted in the new namespace a new empty binfmt instance is created and
used in this namespace.

For instance, using "unshare" we can start a chroot of another
architecture and configure the binfmt_misc interpreter without being root
to run the binaries in this chroot.

How do you ensure that when userspace is no longer using the user
namespace and mount namespace, the entries and the binfmt_misc
superblock are deleted? As far as I can tell from looking at the code,
at the moment, if I create a user namespace+mount namespace, mount
binfmt_misc in there, register a file format and then let all
processes inside the namespaces exit, the binfmt_misc mount will be
kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
will also stay in memory.

[...]

Do you have any idea how I can fix this issue?

I think the easiest way (keeping in mind that we want to avoid having
to fiddle around with reference loops, where e.g. interpreter
executable files opened by binfmt_misc have references back to the
user namespace through ->f_cred) would be to add a new patch in front
of this one that changes the semantics such that when binfmt_misc is
unmounted, all the existing format registrations are deleted. That's
probably also nicer from the perspective of inspectability. It could
in theory break stuff, but I think that's probably somewhat unlikely.
Still, it'd be an API change, and therefore you should CC linux-api@
on such a change.

quoted

@@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
        if (!inode)
                goto out2;

-       err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
+       ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
+       err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
+                           &ns->entry_count);

When you call simple_pin_fs() here, the user namespace of `current`
and the user namespace of the superblock are not necessarily related.
So simple_pin_fs() may end up taking a reference on the mountpoint for
a user namespace that has nothing to do with the namespace for which
an entry is being created.

Do you have any idea how I can fix this issue?

If you fix the memory leak the way I suggested, this wouldn't be a
problem anymore either.

quoted

+static void bm_free(struct fs_context *fc)
+{
+       if (fc->s_fs_info)
+               put_user_ns(fc->s_fs_info);
+}

Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?

So I understand the if is unnecessary and I will remove it.

Your code was actually exactly right, I didn't understand how
fc->s_fs_info works.

quoted

 static int bm_get_tree(struct fs_context *fc)
 {
-       return get_tree_single(fc, bm_fill_super);
+       return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));

get_user_ns() increments the refcount of the namespace, but in the
case where a binfmt_misc mount already exists, that refcount is never
dropped, right? That would be a security bug, since an attacker could
overflow the refcount of the user namespace and then trigger a UAF.
(And the refcount hardening won't catch it because user namespaces
still use raw atomics instead of refcount_t.)

Do you have any idea how I can fix this issue?

Ah, this was actually fine. I missed that get_tree_keyed() stashes
that pointer in fc->s_fs_info.

quoted

+#if IS_ENABLED(CONFIG_BINFMT_MISC)

Nit: Isn't this kind of check normally written as "#ifdef"?

What is the difference?

As explained in Documentation/process/coding-style.rst and the
relevant header, IS_ENABLED() is for inline use in C expressions.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help