Thread (71 messages) 71 messages, 7 authors, 2025-09-19

Re: [PATCH v2 28/33] nsfs: support file handles

From: Christian Brauner <brauner@kernel.org>
Date: 2025-09-15 13:55:13
Also in: cgroups, linux-block, linux-fsdevel, linux-kselftest, linux-nfs, lkml

On Mon, Sep 15, 2025 at 03:25:20PM +0200, Jan Kara wrote:
On Fri 12-09-25 13:52:51, Christian Brauner wrote:
quoted
A while ago we added support for file handles to pidfs so pidfds can be
encoded and decoded as file handles. Userspace has adopted this quickly
and it's proven very useful. Implement file handles for namespaces as
well.

A process is not always able to open /proc/self/ns/. That requires
procfs to be mounted and for /proc/self/ or /proc/self/ns/ to not be
overmounted. However, userspace can always derive a namespace fd from
a pidfd. And that always works for a task's own namespace.

There's no need to introduce unnecessary behavioral differences between
/proc/self/ns/ fds, pidfd-derived namespace fds, and file-handle-derived
namespace fds. So namespace file handles are always decodable if the
caller is located in the namespace the file handle refers to.

This also allows a task to e.g., store a set of file handles to its
namespaces in a file on-disk so it can verify when it gets rexeced that
they're still valid and so on. This is akin to the pidfd use-case.

Or just plainly for namespace comparison reasons where a file handle to
the task's own namespace can be easily compared against others.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
...
quoted
+	switch (ns->ops->type) {
+#ifdef CONFIG_CGROUPS
+	case CLONE_NEWCGROUP:
+		if (!current_in_namespace(to_cg_ns(ns)))
+			owning_ns = to_cg_ns(ns)->user_ns;
+		break;
+#endif
+#ifdef CONFIG_IPC_NS
+	case CLONE_NEWIPC:
+		if (!current_in_namespace(to_ipc_ns(ns)))
+			owning_ns = to_ipc_ns(ns)->user_ns;
+		break;
+#endif
+	case CLONE_NEWNS:
+		if (!current_in_namespace(to_mnt_ns(ns)))
+			owning_ns = to_mnt_ns(ns)->user_ns;
+		break;
+#ifdef CONFIG_NET_NS
+	case CLONE_NEWNET:
+		if (!current_in_namespace(to_net_ns(ns)))
+			owning_ns = to_net_ns(ns)->user_ns;
+		break;
+#endif
+#ifdef CONFIG_PID_NS
+	case CLONE_NEWPID:
+		if (!current_in_namespace(to_pid_ns(ns))) {
+			owning_ns = to_pid_ns(ns)->user_ns;
+		} else if (!READ_ONCE(to_pid_ns(ns)->child_reaper)) {
+			ns->ops->put(ns);
+			return ERR_PTR(-EPERM);
+		}
+		break;
+#endif
+#ifdef CONFIG_TIME_NS
+	case CLONE_NEWTIME:
+		if (!current_in_namespace(to_time_ns(ns)))
+			owning_ns = to_time_ns(ns)->user_ns;
+		break;
+#endif
+#ifdef CONFIG_USER_NS
+	case CLONE_NEWUSER:
+		if (!current_in_namespace(to_user_ns(ns)))
+			owning_ns = to_user_ns(ns);
+		break;
+#endif
+#ifdef CONFIG_UTS_NS
+	case CLONE_NEWUTS:
+		if (!current_in_namespace(to_uts_ns(ns)))
+			owning_ns = to_uts_ns(ns)->user_ns;
+		break;
+#endif
Frankly, switches like these are asking for more Generic usage ;) But ok
for now.
quoted
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (owning_ns && !ns_capable(owning_ns, CAP_SYS_ADMIN)) {
+		ns->ops->put(ns);
+		return ERR_PTR(-EPERM);
+	}
+
+	/* path_from_stashed() unconditionally consumes the reference. */
+	ret = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return no_free_ptr(path.dentry);
Ugh, so IMO this is very subtle because we declare

	struct path path __free(path_put)

but then do no_free_ptr(path.dentry). I really had to lookup implementation
of no_free_ptr() to check whether we are leaking mnt reference here or not
(we are not). But that seems as an implementation detail we shouldn't
better rely on? Wouldn't be:

	return dget(path.dentry);

much clearer (and sligthly less efficient, I know, but who cares)?
Fine by me as well!
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help