Re: [PATCH RFT v3 0/5] fork: Support shadow stacks in clone3()

From: Christian Brauner <brauner@kernel.org>
Date: 2023-11-23 10:10:40
Also in: linux-kselftest, lkml

On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote:

On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote:

quoted

The 11/21/2023 11:17, Christian Brauner wrote:

quoted

I have a few questions that are probably me just not knowing much about
shadow stacks so hopefully I'm not asking you write a thesis by
accident:

One thing it feels like it's worth saying up front here is that shadow
stacks are userspace memory with special permissions and instructions
for access - they are mapped into the userspace address range and
userspace can directly interact with them in restricted ways.  For
example there's some thought to using shadow stacks in unwinders since
all the return addresses are stored in a single convenient block of
memory which it's much harder to corrupt.  Overflowing a shadow stack
results in userspace getting a memory access fault just as with other
memory access issues.

Thanks for that summary.

quoted

(2) With what other interfaces is implicit allocation and deallocation
    not consistent? I don't understand this argument. The kernel creates
    a shadow stack as a security measure to store return addresses. It
    seems to me exactly that the kernel should implicitly allocate and
    deallocate the shadow stack and not have userspace muck around with
    its size?

quoted

the kernel is not supposed to impose stack size policy or a particular
programming model that limits the stack management options nor prevent
the handling of stack overflows.

The inconsistency here is with the management of the standard stack -
with the standard stack userspace passes an already allocated address
range to the kernel.  A constant tension during review of the shadow
stack interfaces has been that shadow stack memory is userspace memory
but the security constraints mean that we've come down on the side of
having a custom allocation syscall for it instead of using flags on
mmap() and friends like people often expect, and now having it allocated
as part of clone3().  The aim is to highlight that this difference is

So you have two interfaces for allocating a shadow stack. The first one
is to explicitly alloc a shadow stack via the map_shadow_stack(). The
second one is an implicit allocation during clone3() and you want to
allow explicitly influencing that.

deliberately chosen for specific reasons rather than just carelessness.

quoted

(3) Why is it safe for userspace to request the shadow stack size? What
    if they request a tiny shadow stack size? Should this interface
    require any privilege?

quoted

user can allocate huge or tiny stacks already.

quoted

and i think userspace can take control over shadow stack management:
it can disable signals, start a clone child with stack_size == 1 page,
map_shadow_stack and switch to it, enable signals. however this is
complicated, leaks 1 page of kernel allocated shadow stack (+reserved
guard page, i guess userspace could unmap, not sure if that works
currently) and requires additional syscalls.

The other thing here is that if userspace gets this wrong it'll result
in the userspace process hitting the top of the stack and getting fatal
signals in a similar manner to what happens if it gets the size of
the standard stack wrong (the kernel allocation does mean that there
should always be guard pages and it's harder to overrun the stack and
corrupt adjacent memory).  There doesn't seem to be any meaningful risk
here over what userspace can already do to itself anyway as part of
thread allocation.

clone3() _aimed_ to cleanup the stack handling a bit but we had concerns
that deviating too much from legacy clone() would mean userspace
couldn't fully replace it. So we would have liked to clean up stack
handling a lot more but there's limits to that. We do however perform
basic sanity checks now.

quoted

(4) Why isn't the @stack_size argument I added for clone3() enough?
    If it is specified can't the size of the shadow stack derived from it?

quoted

shadow stack only contains return addresses so it is proportional
to the number of stack frames, not the stack size and it must
account for sigaltstack too, not just the thread stack.

quoted

if you make minimal assumptions about stack usage and ignore the
sigaltstack issue then the worst case shadow stack requirement
is indeed proportional to the stack_size, but this upper bound
can be pessimistic and userspace knows the tradeoffs better.

It's also worth pointing out here that the existing shadow stack support
for x86 and in review code for arm64 make exactly these assumptions and
guesses at a shadow stack size based on the stack_size for the thread.

Ok.

There's just been a general lack of enthusiasm for the fact that due to
the need to store variables on the normal stack the resulting shadow
stack is very likely to be substantially overallocated but we can't
safely reduce the size without information from userspace.

Ok.

quoted

And my current main objection is that shadow stacks were just released
to userspace. There can't be a massive amount of users yet - outside of
maybe early adopters.

quoted

no upstream libc has code to enable shadow stacks at this point
so there are exactly 0 users in the open. (this feature requires
runtime support)

quoted

the change is expected to allow wider deployability. (e.g. not
just in glibc)

Right, and the lack of any userspace control of the shadow stack size
has been a review concern with the arm64 GCS series which I'm trying to
address here.  The main concern is that userspaces that start a lot of
threads are going to start using a lot more address space than they need
to when shadow stacks are enabled.  Given the fairly long deployment
pipeline from extending a syscall to end users who might be using the
feature in conjuction with imposing resource limits it does seem like a
reasonable problem to anticipate.

Ok, I can see that argument.

quoted

The fact that there are other architectures that bring in a similar
feature makes me even more hesitant. If they have all agreed _and_
implemented shadow stacks and have unified semantics then we can
consider exposing control knobs to userspace that aren't implicitly
architecture specific currently.

To be clear the reason I'm working on this is that I've implemented the
arm64 support, I don't even have any access to x86 systems that have the

Yes, I'm aware.

feature (hence the RFT in the subject line) - Rick Edgecombe did the x86
work.  The arm64 code is still in review, the userspace interface is
very similar to that for x86 and there doesn't seem to be any
controversy there which makes me expect that a change is likely.  Unlike
x86 we only have a spec and virtual implementations at present, there's
no immintent hardware, so we're taking our time with making sure that
everything is working well.  Deepak Gupta (in CC) has been reviewing the
series from the point of view of RISC-V.  I think we're all fairly well
aligned on the requirements here.

I'm still not enthusiastic that we only have one implementation for this
in the kernel. What's the harm in waiting until the arm patches are
merged? This shouldn't result in chicken and egg: if the implementations
are sufficiently similar then we can do an appropriate clone3()
extension.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help