Thread (26 messages) 26 messages, 6 authors, 27m ago

Re: [RFC] Null Namespaces

From: Christian Brauner <brauner@kernel.org>
Date: 2026-06-29 11:45:32
Also in: linux-arch, linux-fsdevel, lkml
Subsystem: file locking (flock() and fcntl()/lockf()), the rest · Maintainers: Jeff Layton, Chuck Lever, Linus Torvalds

On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:
Hello, I am hoping to discuss an idea I've had for a while, that I am
calling "null namespaces" that has become more relevant with some recent
other discussions. First I'll discuss null namespaces in general terms,
and then I'll link those recent discussions and relate null namespaces
to them.

### Null namespaces

The essence of null namespaces is trying to give processes as little
ambient authority as possible, so they are lighter weight and allowed to
do even less than fully unshared processes today.

Namespaces as they exist today are frequently described as an isolation
mechanism, but I think this is the conflation of two different things.
*Removing* a new process from its parent's namespaces unquestionably is
increasing isolation --- no disagreement there. But putting the process
in new namespaces is something else; I would call it supporting
"delusions of grandeur" of that process. For example, namespaces allow a
process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
look up other processes by PID, etc.

Conceptually, to remove a process from one ambient authority scope (the
very name "namespaces" indicates they are about ambient authority)
should not require putting it in some ambient authority scope. Just
because, for example, the process cannot see one mount tree, doesn't
mean it needs to see another.

Here's what I am thinking would happen concretely:

First, the simpler cases:

#### Null mount namespace

- requires:

  - null root file system: absolute paths don't work.

  - null current working directory: relative paths with traditional,
    non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

- All operations relating to the "ambient" mount tree don't work.

- `*at` operations with a file descriptor do work.

- The new fd-based mount APIs with detached mounts do work, modulo
  the calling process having enough permissions (as usual).
Nothing here requires you to NULL anything and I oppose this on code
sanity reasons alone. We shoud absolutely not start to stash any NULL
pointers in core kernel objects such as struct path that are used
everywhere.

So I've added nullfs a few releases back. It's currently not mountable
from userspace but I've already mentioned in the commit message that
this is going to change. But I also added:

unshare(UNSHARE_EMPTY_MNTNS)
clone3(CLONE_EMPTY_MNTNS)

In both cases the process is placed into a completely empty mount
namespace with nullfs as it's root and cwd. If you're in a new mount
namespace with CAP_SYS_ADMIN thrown away it means you're going to be in
nullfs forever.

It's possible we can come up with:

unshare(UNSHARE_FS_EMPTY)
clone3(CLONE_FS_EMPTY)

which just moves the task into an isolated nullfs instance (it would
need some thinking about interactions with chroot()).

But I guess the even simpler model would be to copy what I've been doing
for pidfs:

+static struct path nullfs_root_path = {};
+
+void nullfs_get_root(struct path *path)
+{
+       *path = nullfs_root_path;
+       path_get(path);
+}
+
 static void __init init_mount_tree(void)
 {
        struct vfsmount *mnt, *nullfs_mnt;
@@ -6209,6 +6217,8 @@ static void __init init_mount_tree(void)
        /* Mount mutable rootfs on top of nullfs. */
        root.mnt                = nullfs_mnt;
        root.dentry             = nullfs_mnt->mnt_root;
+       nullfs_root_path.mnt    = nullfs_mnt;
+       pidfs_root_path.dentry  = nullfs_mnt->mnt_root;

        LOCK_MOUNT_EXACT(mp, &root);
        if (unlikely(IS_ERR(mp.parent)))
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index aadfbf6e0cb3..f55c87c70b78 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -124,6 +124,7 @@ struct delegation {

 #define FD_PIDFS_ROOT                  -10002 /* Root of the pidfs filesystem */
 #define FD_NSFS_ROOT                   -10003 /* Root of the nsfs filesystem */
+#define FD_NULLFS_ROOT                 -10004 /* Root of the nullfs filesystem */
 #define FD_INVALID                     -10009 /* Invalid file descriptor: -10000 - EBADF = -10009 */

 /* Generic flags for the *at(2) family of syscalls. */
we then add fchroot() (overdue anyway) and then teach both fchdir() and
fchroot() to honor FD_NULLFS_ROOT. Then a process may shed its fs state
and move itself into nullfs. Restrict *chdir() and *chroot() for said
process via seccomp and it's locked in forever as well.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help