Thread (19 messages) 19 messages, 6 authors, 2020-09-17

Re: [PATCH v5 1/3] open: add close_range()

From: Szabolcs Nagy <hidden>
Date: 2020-06-05 15:01:33
Also in: linux-fsdevel, lkml

* Christian Brauner [off-list ref] [2020-06-02 22:42:17 +0200]:
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I've also coordinated with some FreeBSD developers who got in touch with
me (Cced below). FreeBSD intends to add the same syscall once we merged it.
Quite a bunch of projects in userspace are waiting on this syscall
including Python and systemd.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);
this api needs a documentation patch if there isn't yet.

currently there is no libc interface contract in place that
says which calls may use libc internal fds e.g. i've seen

  openlog(...) // opens libc internal syslog fd
  ...
  fork()
  closefrom(...) // close syslog fd
  open(...) // something that reuses the closed fd
  syslog(...) // unsafe: uses the wrong fd
  execve(...)

syslog uses a libc internal fd that the user trampled on and
this can go bad in many ways depending on what libc apis are
used between closefrom (or equivalent) and exec.
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
why is a switch_to_cloexec_range worse than close_range?
the former seems safer to me. (and allows libc calls
to be made between such switch and exec: libc internal
fds have to be cloexec anyway)
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
close_range still seems like a bad operation to expose.

if users really want closing behaviour (instead of marking
fds cloexec) then they likely need coordination with libc
and other libraries.

e.g. this usage does not work:

  maxfd = findmaxfd();
  call_that_may_leak_fds();
  close_range(maxfd,~0U);

as far as i can tell only the close right before exec works.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help