Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
From: Christian Brauner <brauner@kernel.org>
Date: 2025-05-16 10:34:20
Also in:
linux-fsdevel, linux-security-module, lkml
On Thu, May 15, 2025 at 10:56:26PM +0200, Jann Horn wrote:
On Thu, May 15, 2025 at 12:04 AM Christian Brauner [off-list ref] wrote:quoted
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP mask flag. This adds the fields @coredump_mask and @coredump_cookie to struct pidfd_info.FWIW, now that you're using path-based sockets and override_creds(), one option may be to drop this patch and say "if you don't want untrusted processes to directly connect to the coredumping socket, just set the listening socket to mode 0000 or mode 0600"...quoted
Signed-off-by: Christian Brauner <brauner@kernel.org>[...]quoted
diff --git a/fs/coredump.c b/fs/coredump.c index e1256ebb89c1..bfc4a32f737c 100644 --- a/fs/coredump.c +++ b/fs/coredump.c[...]quoted
@@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo) goto close_fail; } + /* + * Set the thread-group leader pid which is used for the + * peer credentials during connect() below. Then + * immediately register it in pidfs... + */ + cprm.pid = task_tgid(current); + retval = pidfs_register_pid(cprm.pid); + if (retval) { + sock_release(socket); + goto close_fail; + } + + /* + * ... and set the coredump information so userspace + * has it available after connect()... + */ + pidfs_coredump(&cprm); + + /* + * ... On connect() the peer credentials are recorded + * and @cprm.pid registered in pidfs...I don't understand this comment. Wasn't "@cprm.pid registered in pidfs" above with the explicit `pidfs_register_pid(cprm.pid)`?
I'll answer both questions in one go below...
quoted
+ */ retval = kernel_connect(socket, (struct sockaddr *)(&addr), addr_len, O_NONBLOCK | SOCK_COREDUMP); + + /* ... So we can safely put our pidfs reference now... */ + pidfs_put_pid(cprm.pid);Why can we safely put the pidfs reference now but couldn't do it before the kernel_connect()? Does the kernel_connect() look up this pidfs entry by calling something like pidfs_alloc_file()? Or does that only happen later on, when the peer does getsockopt(SO_PEERPIDFD)?
AF_UNIX sockets support SO_PEERPIDFD as you know. Users such as dbus or systemd want to be able to retrieve a pidfd for the peer even if the peer has already been reaped. To support this AF_UNIX ensures that when the peer credentials are set up (connect(), listen()) the corresponding @pid will also be registered in pidfs. This ensures that exit information is stored in the inode if we hand out a pidfd for a reaped task. IOW, we only hand out pidfds for reaped task if at the time of reaping a pidfs entry existed for it. Since we're setting coredump information on the pidfd here we're calling pidfs_register_pid() even before connect() sets up the peer credentials so we're sure that the coredump information is stored in the inode. Then we delay our pidfs_put_pid() call until the connect() took it's own reference and thus continues pinning the inode. IOW, connect() will also call pidfs_register_pid() but it will ofc just increment the reference count ensuring that our pidfs_put_pid() doesn't drop the inode. If we immediately did a pidfs_put_pid() before connect() we'd loose the coredump information.
quoted
if (retval) { if (retval == -EAGAIN) coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);[...]quoted
diff --git a/fs/pidfs.c b/fs/pidfs.c index 3b39e471840b..d7b9a0dd2db6 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c[...]quoted
@@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) } } + if (mask & PIDFD_INFO_COREDUMP) { + kinfo.mask |= PIDFD_INFO_COREDUMP; + smp_rmb();I assume I would regret it if I asked what these barriers are for, because the answer is something terrifying about how we otherwise don't have a guarantee that memory accesses can't be reordered between multiple subsequent syscalls or something like that?
No, not really. It's just so that when someone calls PIDFD_GET_INFO with PIDFD_INFO_COREDUMP but one gotten from the coredump socket that they don't see half-initialized information. I can just use WRITE_ONCE() for that.
checkpatch complains about the lack of comments on these memory barriers.
I'll just use WRITE_ONCE().
quoted
+ kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie); + kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); + } + task = get_pid_task(pid, PIDTYPE_PID); if (!task) { /*[...]quoted
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index a9d1c9ba2961..053d2e48e918 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c[...]quoted
@@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion) struct unix_peercred { struct pid *peer_pid; + u64 cookie;Maybe add a comment here documenting that for now, this is assumed to be used exclusively for coredump sockets.
I think we should just drop it.