Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE

From: Mickaël Salaün <mic@digikod.net>
Date: 2025-08-25 09:31:52
Also in: linux-api, linux-fsdevel, linux-integrity, lkml

On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:

On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün [off-list ref] wrote:

quoted

On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:

quoted

On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün [off-list ref] wrote:

quoted

Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
passed file descriptors).  This changes the state of the opened file by
making it read-only until it is closed.  The main use case is for script
interpreters to get the guarantee that script' content cannot be altered
while being read and interpreted.  This is useful for generic distros
that may not have a write-xor-execute policy.  See commit a5874fde3c08
("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")

Both execve(2) and the IOCTL to enable fsverity can already set this
property on files with deny_write_access().  This new O_DENY_WRITE make

The kernel actually tried to get rid of this behavior on execve() in
commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
because it broke userspace assumptions.

Oh, good to know.

quoted

it widely available.  This is similar to what other OSs may provide
e.g., opening a file with only FILE_SHARE_READ on Windows.

We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
removed for security reasons; as
https://man7.org/linux/man-pages/man2/mmap.2.html says:

|        MAP_DENYWRITE
|               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
|               signaled that attempts to write to the underlying file
|               should fail with ETXTBSY.  But this was a source of denial-
|               of-service attacks.)"

It seems to me that the same issue applies to your patch - it would
allow unprivileged processes to essentially lock files such that other
processes can't write to them anymore. This might allow unprivileged
users to prevent root from updating config files or stuff like that if
they're updated in-place.

Yes, I agree, but since it is the case for executed files I though it
was worth starting a discussion on this topic.  This new flag could be
restricted to executable files, but we should avoid system-wide locks
like this.  I'm not sure how Windows handle these issues though.

Anyway, we should rely on the access control policy to control write and
execute access in a consistent way (e.g. write-xor-execute).  Thanks for
the references and the background!

I'm confused.  I understand that there are many contexts in which one
would want to prevent execution of unapproved content, which might
include preventing a given process from modifying some code and then
executing it.

I don't understand what these deny-write features have to do with it.
These features merely prevent someone from modifying code *that is
currently in use*, which is not at all the same thing as preventing
modifying code that might get executed -- one can often modify
contents *before* executing those contents.

The order of checks would be:
1. open script with O_DENY_WRITE
2. check executability with AT_EXECVE_CHECK
3. read the content and interpret it

The deny-write feature was to guarantee that there is no race condition
between step 2 and 3.  All these checks are supposed to be done by a
trusted interpreter (which is allowed to be executed).  The
AT_EXECVE_CHECK call enables the caller to know if the kernel (and
associated security policies) allowed the *current* content of the file
to be executed.  Whatever happen before or after that (wrt.
O_DENY_WRITE) should be covered by the security policy.

In any case, IMO it's rather sad that the elimination of ETXTBSY had
to be reverted -- it's really quite a nasty feature.  But it occurs to
me that Linux can more or less do what is IMO the actually desired
thing: snapshot the contents of a file and execute the snapshot.  The
hack at the end of the email works!  (Well, it works if the chosen
filesystem supports it.)

$ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3

emacs is apparently far, far too clever and can't save if you do:

$ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3


I'm not seriously suggesting that anyone should execute binaries or
scripts on Linux exactly like this, for a whole bunch of reasons:

- It needs filesystem support (but maybe this isn't so bad)

- It needs write access to a directory on the correct filesystem (a
showstopper for serious use)

- It is wildly incompatible with write-xor-execute, so this would be a
case of one step forward, ten steps back.

- It would defeat a lot of tools that inspect /proc, which would be
quite annoying to say the least.


But maybe a less kludgy version could be used for real.  What if there
was a syscall that would take an fd and make a snapshot of the file?

Yes, that would be a clean solution.  I don't think this is achievable
in an efficient way without involving filesystem implementations though.

It would, at least by default, produce a *read-only* snapshot (fully
sealed a la F_SEAL_*), inherit any integrity data that came with the
source (e.g. LSMs could understand it), would not require a writable
directory on the filesystem, and would maybe even come with an extra
seal-like thing that prevents it from being linkat-ed.  (I'm not sure
that linkat would actually be a problem, but I'm also not immediately
sure that LSMs would be as comfortable with it if linkat were
allowed.)  And there could probably be an extremely efficient
implementation that might even reuse the existing deny-write mechanism
to optimize the common case where the file is never written.

For that matter, the actual common case would be to execute stuff in
/usr or similar, and those files really ought never to be modified.
So there could be a file attribute or something that means "this file
CANNOT be modified, but it can still be unlinked or replaced as
usual", and snapshotting such a file would be a no-op.  Distributions
and container tools could set that attribute.  Overlayfs could also
provide an efficient implementation if the file currently comes from
an immutable source.

Hmm, maybe it's not strictly necessary that it be immutable -- maybe
it's sometimes okay if reads start to fail if the contents change.
Let's call this a "weak snapshot" -- reads of a weak snapshot either
return the original contents or fail.  fsverity would give weak
snapshots for at no additional cost.


It's worth noting that the common case doesn't actually need an fd.
We have mmap(..., MAP_PRIVATE, ...).  What we would actually want for
mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics
that the kernel promises that future writes to the source would either
not be reflected in the mapping or would cause SIGBUS.  One might
reasonably debate what forced-writes would do (I think forced-writes
should be allowed just like they currently are, since anyone who can
force-write to process memory is already assumed to be permitted to
bypass write-xor-execute).


---

/* Written by Claude Sonnet 4 with a surprisingly small amount of help
from Andy */

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc < 4) {
        fprintf(stderr, "Usage: %s <source_file> <temp_dir>
[exec_args...]\n", argv[0]);
        exit(1);
    }

    const char *source_file = argv[1];
    const char *temp_dir = argv[2];

    // Open source file
    int source_fd = open(source_file, O_RDONLY);
    if (source_fd == -1) {
        perror("Failed to open source file");
        exit(1);
    }

    // Create temporary file
    int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600);
    if (temp_fd == -1) {
        perror("Failed to create temporary file");
        close(source_fd);
        exit(1);
    }

    // Clone the file contents using FICLONE
    if (ioctl(temp_fd, FICLONE, source_fd) == -1) {
        perror("Failed to clone file");
        close(source_fd);
        close(temp_fd);
        exit(1);
    }

    // Close source file
    close(source_fd);

    // Make sure temp file is on fd 3
    if (temp_fd != 3) {
        if (dup2(temp_fd, 3) == -1) {
            perror("Failed to move temp file to fd 3");
            close(temp_fd);
            exit(1);
        }
        close(temp_fd);
    }

    // Execute the remaining arguments
    if (argc >= 3) {
        execvp(argv[3], &argv[3]);
        perror("Failed to execute command");
        exit(1);
    }

    return 0;
}

As you said, this doesn't work if temp_dir is not allowed for execution,
and it doesn't allow the kernel to check/track the content of the
script, which is the purpose of AT_EXECVE_CHECK.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help