Thread (115 messages) 115 messages, 12 authors, 2023-03-01

Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: 2023-02-23 01:12:00
Also in: linux-arch, linux-doc, linux-mm, lkml

On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
quoted
When operating with shadow stacks enabled, the kernel will
automatically
allocate shadow stacks for new threads, however in some cases
userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating
and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to
be setup
with a restore token so that userspace can pivot to them via the
RSTORSSP
instruction. But, the security design of shadow stack's is that
they
should not be written to except in limited circumstances. This
presents a
problem for userspace, as to how userspace can provision this
special
data, without allowing for the shadow stack to be generally
writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned.
This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly
to
userspace shadow stacks. So the solution can be that memory can be
mapped
as shadow stack permissions from the beginning (never generally
writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on
the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable
memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to
prevent
   restore tokens being written into the middle of pre-used shadow
stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had
never been
   written to.
3. It stood out from the rest of the madvise flags, as more of
direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that
don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While
ucontext
is the primary motivator, userspace may have other unforeseen
reasons to
setup it's own shadow stacks using the WRSS instruction. Towards
this
provide a flag so that stacks can be optionally setup securely for
the
common case of ucontext without enabling WRSS. Or potentially have
the
kernel set up the shadow stack in some new way.
Was following ever attempted?

void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
- limit PROT_SHADOWSTACK protection flag to only mmap (and thus
mprotect can't
   convert memory from shadow stack to non-shadow stack type or vice
versa)
- limit PROT_SHADOWSTACK protection flag to anonymous memory only.
- top level mmap handler to put a token at the base using WRUSS if
prot == PROT_SHADOWSTACK

You essentially would get shadow stack manufacturing with existing
(single) syscall.
Acting a bit selfish here, this allows other architectures as well to
re-use this and 
do their own implementation of mapping and placing the token at the
base.
Yes, I looked at it. You end up with a pile of checks and hooks added
to mmap() and various other places as you outline. We also now have the
MAP_ABOVE4G limitation for x86 shadow stack that would need checking
for too. It's not exactly a clean fit. Then, callers would have to pass
special x86 flags in anyway.

It doesn't seem like the complexity of the checks is worth saving the
tiny syscall. Is there some reason why riscv can't use the same syscall
stub? It doesn't need to live forever in x86 code. Not sure what the
savings are for riscv of the mmap+checks approach are either...

I did wonder if there could be some sort of more general syscall for
mapping and provisioning special security-type memory. But we probably
need a few more non-shadow stack examples to get an idea of what that
would look like.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help