Re: [PATCH v5] rust: add PidNamespace

From: Christian Brauner <brauner@kernel.org>
Date: 2024-10-07 12:00:00
Also in: linux-fsdevel, lkml, rust-for-linux

On Fri, Oct 04, 2024 at 02:01:36PM GMT, Alice Ryhl wrote:

On Wed, Oct 2, 2024 at 1:38 PM Christian Brauner [off-list ref] wrote:

quoted

The lifetime of `PidNamespace` is bound to `Task` and `struct pid`.

The `PidNamespace` of a `Task` doesn't ever change once the `Task` is
alive. A `unshare(CLONE_NEWPID)` or `setns(fd_pidns/pidfd, CLONE_NEWPID)`
will not have an effect on the calling `Task`'s pid namespace. It will
only effect the pid namespace of children created by the calling `Task`.
This invariant guarantees that after having acquired a reference to a
`Task`'s pid namespace it will remain unchanged.

When a task has exited and been reaped `release_task()` will be called.
This will set the `PidNamespace` of the task to `NULL`. So retrieving
the `PidNamespace` of a task that is dead will return `NULL`. Note, that
neither holding the RCU lock nor holding a referencing count to the
`Task` will prevent `release_task()` being called.

In order to retrieve the `PidNamespace` of a `Task` the
`task_active_pid_ns()` function can be used. There are two cases to
consider:

(1) retrieving the `PidNamespace` of the `current` task (2) retrieving
the `PidNamespace` of a non-`current` task

From system call context retrieving the `PidNamespace` for case (1) is
always safe and requires neither RCU locking nor a reference count to be
held. Retrieving the `PidNamespace` after `release_task()` for current
will return `NULL` but no codepath like that is exposed to Rust.

Retrieving the `PidNamespace` from system call context for (2) requires
RCU protection. Accessing `PidNamespace` outside of RCU protection
requires a reference count that must've been acquired while holding the
RCU lock. Note that accessing a non-`current` task means `NULL` can be
returned as the non-`current` task could have already passed through
`release_task()`.

To retrieve (1) the `current_pid_ns!()` macro should be used which
ensure that the returned `PidNamespace` cannot outlive the calling
scope. The associated `current_pid_ns()` function should not be called
directly as it could be abused to created an unbounded lifetime for
`PidNamespace`. The `current_pid_ns!()` macro allows Rust to handle the
common case of accessing `current`'s `PidNamespace` without RCU
protection and without having to acquire a reference count.

For (2) the `task_get_pid_ns()` method must be used. This will always
acquire a reference on `PidNamespace` and will return an `Option` to
force the caller to explicitly handle the case where `PidNamespace` is
`None`, something that tends to be forgotten when doing the equivalent
operation in `C`. Missing RCU primitives make it difficult to perform
operations that are otherwise safe without holding a reference count as
long as RCU protection is guaranteed. But it is not important currently.
But we do want it in the future.

Note for (2) the required RCU protection around calling
`task_active_pid_ns()` synchronizes against putting the last reference
of the associated `struct pid` of `task->thread_pid`. The `struct pid`
stored in that field is used to retrieve the `PidNamespace` of the
caller. When `release_task()` is called `task->thread_pid` will be
`NULL`ed and `put_pid()` on said `struct pid` will be delayed in
`free_pid()` via `call_rcu()` allowing everyone with an RCU protected
access to the `struct pid` acquired from `task->thread_pid` to finish.

Signed-off-by: Christian Brauner <brauner@kernel.org>

Overall looks good! A few comments below.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>

quoted

+            task: unsafe { &*PidNamespace::from_ptr(pidns) },

I think you can simplify this to:
task: unsafe { PidNamespace::from_ptr(pidns) },

Done.

quoted

+    /// Returns the given task's pid in the provided pid namespace.
+    #[doc(alias = "task_tgid_nr_ns")]
+    pub fn tgid_nr_ns(&self, pidns: Option<&PidNamespace>) -> Pid {
+        match pidns {
+            // SAFETY: By the type invariant, we know that `self.0` is valid. We received a valid
+            // PidNamespace that we can use as a pointer.
+            Some(pidns) => unsafe { bindings::task_tgid_nr_ns(self.0.get(), pidns.as_ptr()) },
+            // SAFETY: By the type invariant, we know that `self.0` is valid. We received an empty
+            // PidNamespace and thus pass a null pointer. The underlying C function is safe to be
+            // used with NULL pointers.
+            None => unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) },

The compiler generates better code if you do this:

let pidns = match pidns {
    Some(pidns) => pidns.as_ptr(),
    None => core::ptr::null_mut(),
};
unsafe { bindings::task_tgid_nr_ns(self.0.get(), pidns) };

Here it should be able to compile the entire match statement down to a
no-op since None is represented as a null pointer.

Ah, great. Thanks and done!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help