Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec... | linux-api

Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

From: Jeff Xu <hidden>
Date: 2023-08-16 22:48:06
Also in: linux-kselftest, linux-mm, lkml, stable

On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet
[off-list ref] wrote:

Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:

quoted

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host
filesystem (not to mention the many other things a CAP_SYS_ADMIN process
would be able to do that would be equivalent or worse), it seems strange
to cause a fair amount of headache to admins when there doesn't appear
to be an actual security benefit to blocking this. There appear to be
concerns about confused-deputy-esque attacks[2] but a confused deputy that
can write to arbitrary sysctls is a bigger security issue than
executable memfds.

Something to point out: The demo code might be enough to prove your
case in other distributions, however, in ChromeOS, you can't run this
code. The executable in ChromeOS are all from known sources and
verified at boot.
If an attacker could run this code in ChromeOS, that means the
attacker already acquired arbitrary code execution through other ways,
at that point, the attacker no longer needs to create/find an
executable memfd, they already have the vehicle. You can't use an
example of an attacker already running arbitrary code to prove that
disable downgrading is useless.
I agree it is a big problem that an attacker already can modify a
sysctl.  Assuming this can happen by controlling arguments passed into
sysctl, at the time, the attacker might not have full arbitrary code
execution yet, that is the reason the original design is so
restrictive.

I don't understand how you can say an attacker cannot run arbitrary code
within a process here, yet assert that they'd somehow run memfd_create +
execveat on it if this sysctl is lowered -- the two look equivalent to
me?

It might require multiple steps for this attack, one possible scenario:
1> control a write primitive in CAP_SYSADMIN process's memory,  change
arguments of sysctl call, and downgrade the setting for memfd, e.g. change
it=0 to revert to old behavior (by default creating executable memfd)
2> control a non-privileged process that creates and writes to
memfd, and write the contents with the binary that the
attacker wants. This process just needs non-executable memfd, but
isn't updated yet.
3> Confuse a non-privilege process to execute the memfd the attacker
wrote in step 2.

In chromeOS, because all the executables are from verified sources,
attackers typically can't easily use the step 3 alone (without step
2),  and memfd was such a hole that enables an unverified executable.

In the original design, downgrading is not allowed, the attack chain
of 2/3 is completely blocked.  With this new approach, attackers will
try to find an additional step (step 1) to make the old attack (step 2
and 3) working again. It is difficult but I can't say it is
impossible.

CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
soon as you can run arbitrary code (just have a look at the various
container escape example when the capability is given); I see little
point in trying to harden just this here.

I'm not an expert in containers, if the industry is giving up on
privileged containers, then the reasoning makes sense.
From ChromeOS point of view, we don't use runc currently, so I think
it makes more sense for runc users to drive these features.  The
original design is with runc's in mind, and even privileged containers
can't downgrade its own setting.

It'd make more sense to limit all sysctl modifications in the context
you're thinking of through e.g. selinux or another LSM.

I agree,  when I think more about this.
Security features fit LSM better, LSM can do additional "allow/deny"
on otherwise allowed behavior from user space code. Based on that,
"disallow downgrading" fits LSM better.  Also from the same reasoning,
I have second thoughts on the "=2", originally the "MEMFD_EXE was left
out due to the thinking, if user code explicitly setting MEMFD_EXE,
sysctl should not block it, it is the work of LSM. However, the "=2"
has evolved to block MEMFD_EXE completely ... alas .. it might be too
late to revert this, if this is what devs want, it can be that way.

Thanks
Best regards,
-Jeff




-Jeff

(in the context of users making their own containers, my suggestion is
always to never use CAP_SYS_ADMIN, or if they must give it to a separate
minimal container where they can limit user interaction)


FWIW, I also think the proposed =2 behaviour makes more sense, but this
is something we already discussed last month so I won't come back to it
as not really involved here.

--
Dominique Martinet | Asmadeus

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help