Re: [PATCH 0/4] pid: add pidctl()

From: Jonathan Kowalski <hidden>
Date: 2019-03-25 19:42:32
Also in: lkml

On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione [off-list ref] wrote:

On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski [off-list ref] wrote:

quoted

On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione [off-list ref] wrote:

quoted

[..snip..]

I don't like the idea of having one kind of pollfd be pollable and
another not. Such an interface would be confusing for users. If, as
you suggest below, we instead make the procfs-less FD the only thing
we call a "pidfd", then this proposal becomes less confusing and more
viable.

That's certainly on the table, we remove the ability to open
/proc/<PID> and use the dir fd with pidfd_send_signal. I'm in favor of
this.

quoted

But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
can be translated into a "pidfd" using another syscall or even a node, like
/proc/pid/handle or something. I think this is what Christian suggested in
the previous threads.

/proc/pid/handle, if I understand correctly, "translates" a
procfs-based pidfd to a non-pidfd one. The needed interface would have
to perform the opposite translation, providing a procfs directory for
a given pidfd.

quoted

And also for the translation the other way, add a syscall or modify
translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
to /proc/pid directory fds.

This approach would work, but there's one subtlety to take into
account: which procfs? You'd want to take, as inputs, 1) the procfs
root you want, and 2) the pidfd, this way you could return to the user
a directory FD in the right procfs.

I don't think metadata access is in the scope of such a pidfd itself.

It should be possible to write a race-free pkill(1) using pidfds.
Without metadata access tied to the pidfd somehow, that can't be done.

quoted

Should we work on patches for these? Please let us know if this idea makes
sense and thanks a lot for adding us to the review as well.

I would strongly prefer that we not merge pidctl (or whatever it is)
without a story for metadata access, be it your suggestion or
something else.

IMO, this is out of scope for a process handle. Hence, the need for
metadata access musn't stall it as is.

On the contrary, rather than metadata being out of scope, metadata
access is essential. Given a file handle (an FD), you can learn about
the file backing that handle by using fstat(2). Given a socket handle
(a socket FD), you can learn about the socket with getsockopt(2) and
ioctl(2). It would be strange not to be able, similarly, to learn
things about a process given a handle (a pidfd) to that process. I
want process handles to be "boring" in that they let users query and
manipulate processes mostly like they manipulate other resources.

Yes, but everything in /proc is not equivalent to an attribute, or an
option, and depending on its configuration, you may not want to allow
processes to even be able to see /proc for any PIDs other than those
running as their own user (hidepid). This means, even if this new
system call is added, to respect hidepid, it must, depending on if
/proc is mounted (and what hidepid is set to, and what gid= is set
to), return EPERM, because then there is a discrepancy between how the
two entrypoints to acquire a process handle do access control. This is
ugly, but ideally should work the way it is described above.
Otherwise, things would be able to bypass the restrictions made
therein, because we also want a system call.

PIDs however are addressable with kill(2) even with hidepid enabled.
For good reason, the capabilities you can exercise using a PID is
limited in that case. The same logic applies to a process reference
like pidfd.

I agree there should be some way (if you can take care of the hidepid
gotcha) to return a dir fd of /proc entry, but I don't think it should
be the pidfd itself. You can get it to work using the fdinfo thing for
now.

quoted

If you really need this, the pid this pidfd is mapped to can be
exposed through fdinfo (which is perfectly fine for your case as you
use /proc anyway). This means you can reliably determine if you're
opening it for the same pid (and it hasn't been recycled) because this
pidfd will be pollable for state change (in the future). Exposing it
through fdinfo isn't a problem, pid ns is bound to the process during
its lifetime, it's a process creation time property, so the value
isn't going to change anyway. So you can do all metadata access you
want subsequently.

Thanks for the proposal. I'm a bit confused. Are you suggesting that
we report the numeric FD in fdinfo? Are you saying it should work
basically like this?

Yes, numeric PID that you would see from your namespace for the said
process the pidfd is for. It could be -1 if this process is not in any
of the namespaces (current or children) (which would happen, say if
you pass it to some other pidns residing process, during fd
installation on ithe target process we set that up properly).

int get_oom_score_adj(int pidfd) {
  unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
  string fdinfo = read_all(fdinfo_fd);
  numeric_pid = parse_fdinfo_for_line("pid");
  unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
  if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
LIVENESS CHECK */
  // We know the process named by pidfd was called NUMERIC_PID
  // in our namespace (because fdinfo told us) and we know that the
named process
  // was alive after we successfully opened its /proc/pid directory, therefore,
  // PROCDIR_FD and PIDFD must refer to the same underlying process.
  unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
  return parse_int(read_all(oom_adj_score_fd));
}

While this interface functions correctly, I have two concerns: 1) the
number of system calls necessary is very large -- by my count,
starting with the pifd, we need six, not counting the follow-on
close(2) calls (which would bring the total to nine[1]),

But hey, that's a bit rich if you're planning to collect metadata from
/proc ;-).

and 2) it's
"fail unsafe": IMHO, most users in practice will skip the line marked
"LIVENESS CHECK", and as a result, their code will appear to work but
contain subtle race conditions. An explicit interface to translate
from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
descriptor would be both more efficient and fail-safe.

[1] as a separate matter, it'd be nice to have a batch version of close(2).

Since /proc is full of gunk, how about adding more to it and making
the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
of the /proc entry of the process it maps to, when one uses
O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
It would be equivalent to opening the proc entry with usual access
restrictions (and hidepid made to work) but without the races, and
because for processes outside your and children pid ns, it shouldn't
work anyway, and since they wouldn't have their entry on this procfs
instance, it would all just fit in nicely?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help