Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
From: Andy Lutomirski <luto@amacapital.net>
Date: 2026-01-21 18:00:33
Also in:
linux-fsdevel, lkml
On Jan 19, 2026, at 2:21 PM, Jeff Layton [off-list ref] wrote: On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:quoted
quoted
On Mon, Jan 19, 2026 at 10:56 AM Askar Safin [off-list ref] wrote: Christian Brauner [off-list ref]:quoted
Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of returning a file descriptor referring to that mount tree OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor to a new mount namespace. In that new mount namespace the copied mount tree has been mounted on top of a copy of the real rootfs.I want to point at security benefits of this. [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work. I like them, and I think they should get wider exposure. ]] If this patchset ([1]) and [2] both land (they are both in "next" now and likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will usually contain exactly 2 mounts: nullfs and whatever was passed to open_tree(OPEN_TREE_NAMESPACE). This means that even if attacker somehow is able to unmount its root and get access to underlying mounts, then the only underlying thing they will get is nullfs. Also this means that other mounts are not only hidden in new namespace, they are fully absent. This prevents attacks discussed here: [3], [4]. Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs" is passed), there is no anymore hidden writable mount shared by all containers, potentially available to attackers. This is concern raised in [5]:quoted
You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to actually _be_ a filesystem. Even with your "fix", containers could communicate with each _other_ through it if it becomes accessible. If a container can get access to an empty initramfs and write into it, it can ask/answer the question "Are there any other containers on this machine running stux24" and then coordinate.I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the path that gives it sensible behavior should be conditional like this. Either make it *always* mount on top of nullfs (regardless of boot options) or find some way to have it actually be the root. I assume the latter is challenging for some reason.I think that's the plan. I suggested the same to Christian last week, and he was amenable to removing the option and just always doing a nullfs_rootfs mount. We think that older runtimes should still "just work" with this scheme. Out of an abundance of caution, we _might_ want a command-line option to make it go back to old way, in case we find some userland stuff that doesn't like this for some reason, but hopefully we won't even need that.
What I mean is: even if for some reason the kernel is running in a mode where the *initial* rootfs is a real fs, I think it would be nice for OPEN_TREE_NAMESPACE to use nullfs.