Thread (17 messages) 17 messages, 7 authors, 2023-10-11

Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace

From: Jann Horn <jannh@google.com>
Date: 2019-12-16 19:09:26
Also in: linux-fsdevel, lkml

On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier [off-list ref] wrote:
This patch allows to have a different binfmt_misc configuration
for each new user namespace. By default, the binfmt_misc configuration
is the one of the previous level, but if the binfmt_misc filesystem is
mounted in the new namespace a new empty binfmt instance is created and
used in this namespace.

For instance, using "unshare" we can start a chroot of another
architecture and configure the binfmt_misc interpreter without being root
to run the binaries in this chroot.
How do you ensure that when userspace is no longer using the user
namespace and mount namespace, the entries and the binfmt_misc
superblock are deleted? As far as I can tell from looking at the code,
at the moment, if I create a user namespace+mount namespace, mount
binfmt_misc in there, register a file format and then let all
processes inside the namespaces exit, the binfmt_misc mount will be
kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
will also stay in memory.

[...]
quoted hunk ↗ jump to hunk
@@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
        if (!inode)
                goto out2;

-       err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
+       ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
+       err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
+                           &ns->entry_count);
When you call simple_pin_fs() here, the user namespace of `current`
and the user namespace of the superblock are not necessarily related.
So simple_pin_fs() may end up taking a reference on the mountpoint for
a user namespace that has nothing to do with the namespace for which
an entry is being created.

[...]
 static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
 {
        int err;
+       struct user_namespace *ns = sb->s_user_ns;
[...]
+       /* create a new binfmt namespace
+        * if we are not in the first user namespace
+        * but the binfmt namespace is the first one
+        */
+       if (READ_ONCE(ns->binfmt_ns) == NULL) {
The READ_ONCE() here is unnecessary, right? AFAIK the VFS layer is
going to ensure that bm_fill_super() can't run concurrently for the
same namespace?
+               struct binfmt_namespace *new_ns;
+
+               new_ns = kmalloc(sizeof(struct binfmt_namespace),
+                                GFP_KERNEL);
+               if (new_ns == NULL)
+                       return -ENOMEM;
+               INIT_LIST_HEAD(&new_ns->entries);
+               new_ns->enabled = 1;
+               rwlock_init(&new_ns->entries_lock);
+               new_ns->bm_mnt = NULL;
+               new_ns->entry_count = 0;
+               /* ensure new_ns is completely initialized before sharing it */
+               smp_wmb();
+               WRITE_ONCE(ns->binfmt_ns, new_ns);
Nit: This would be a little bit semantically clearer if you used
smp_store_release() instead of smp_wmb()+WRITE_ONCE().
+       }
+
        err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
[...]
+static void bm_free(struct fs_context *fc)
+{
+       if (fc->s_fs_info)
+               put_user_ns(fc->s_fs_info);
+}
Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?
+
 static int bm_get_tree(struct fs_context *fc)
 {
-       return get_tree_single(fc, bm_fill_super);
+       return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
get_user_ns() increments the refcount of the namespace, but in the
case where a binfmt_misc mount already exists, that refcount is never
dropped, right? That would be a security bug, since an attacker could
overflow the refcount of the user namespace and then trigger a UAF.
(And the refcount hardening won't catch it because user namespaces
still use raw atomics instead of refcount_t.)

[...]
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
Nit: Isn't this kind of check normally written as "#ifdef"?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help