Re: [PATCH RFT v8 4/9] fork: Add shadow stack support to clone3()

From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2024-08-13 16:25:55
Also in: linux-kselftest, lkml

On Sat, Aug 10, 2024 at 12:06:12AM +0100, Mark Brown wrote:

On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote:

quoted

On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote:

quoted

+	/* This should really be an atomic cmpxchg.  It is not. */
+	if (access_remote_vm(mm, addr, &val, sizeof(val),
+			     FOLL_FORCE) != sizeof(val))
+		goto out;

quoted

If we restrict the shadow stack creation only to the CLONE_VM case, we'd
not need the remote vm access, it's in the current mm context already.
More on this below.

The discussion in previous iterations was that it seemed better to allow
even surprising use cases since it simplifies the analysis of what we
have covered.  If the user has specified a shadow stack we just do what
they asked for and let them worry about if it's useful.

Thanks for the summary of the past discussions, the patch makes more
sense now. I guess it's easier to follow a clone*() syscall where one
can set a new stack pointer even in the !CLONE_VM case. Just let it set
the shadow stack as well with the new ABI.

However, the x86 would be slightly inconsistent here between clone() and
clone3(). I guess it depends how you look at it. The classic clone()
syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
new shadow stack allocated which I'd expect since it's a new stack.
Well, I doubt anyone cares about this scenario. Are there real cases of
!CLONE_VM but with a new stack?

quoted

+	if (val != expected)
+		goto out;

quoted

I'm confused that we need to consume the token here. I could not find
the default shadow stack allocation doing this, only setting it via
create_rstor_token() (or I did not search enough). In the default case,
is the user consuming it? To me the only difference should been the
default allocation vs the one passed by the user via clone3(), with the
latter maybe requiring the user to set the token initially.

As discussed for a couple of previous versions if we don't have the
token and userspace can specify any old shadow stack page as the shadow
stack this allows clone3() to be used to overwrite the shadow stack of
another thread, you can point to a shadow stack page which is currently
in use and then run some code that causes shadow stack writes.  This
could potentially then in turn be used as part of a bigger exploit
chain, probably it's hard to get anything beyond just causing the other
thread to fault but won't be impossible.

With a kernel allocated shadow stack this is not an issue since we are
placing the shadow stack in new memory, userspace can't control where we
place it so it can't overwrite an existing shadow stack.

IIUC, the kernel-allocated shadow stack will have the token always set
while the user-allocated one will be cleared. I was looking to
understand the inconsistency between these two cases in terms of the
final layout of the new shadow stack: one with the token, the other
without. I can see the need for checking but maybe start with requiring
it to be 0 and setting the token before returning, for consistency with
clone().

In the kernel-allocated shadow stack, is the token used for anything? I
can see it's used for signal delivery and return but I couldn't figure
out what it is used for in a thread's shadow stack.

Also, can one not use the clone3() to point to the clone()-allocated
shadow stack? Maybe that's unlikely as an app tends to stick to one
syscall flavour or the other.

quoted

+		/*
+		 * For CLONE_VFORK the child will share the parents
+		 * shadow stack.  Make sure to clear the internal
+		 * tracking of the thread shadow stack so the freeing
+		 * logic run for child knows to leave it alone.
+		 */
+		if (clone_flags & CLONE_VFORK) {
+			shstk->base = 0;
+			shstk->size = 0;
+			return 0;
+		}

quoted

I think we should leave the CLONE_VFORK check on its own independent of
the clone3() arguments. If one passes both CLONE_VFORK and specific
shadow stack address/size, they should be ignored (or maybe return an
error if you want to make it stricter).

This is existing logic from the current x86 code that's been reindented
due to the addition of explicitly specified shadow stacks, it's not new
behaviour.  It is needed to stop the child thinking it has the parent's
shadow stack in the CLONE_VFORK case.

I figured that. But similar to the current !CLONE_VM behaviour where no
new shadow stack is allocated even if a new stack is passed to clone(),
I was thinking of something similar here for consistency: don't set up a
shadow stack in the CLONE_VFORK case or at least allow it only if a new
stack is being set up (if we extend this to clone(), it would be a small
ABI change).

quoted

-	/*
-	 * For !CLONE_VM the child will use a copy of the parents shadow
-	 * stack.
-	 */
-	if (!(clone_flags & CLONE_VM))
-		return 0;
+		/*
+		 * For !CLONE_VM the child will use a copy of the
+		 * parents shadow stack.
+		 */
+		if (!(clone_flags & CLONE_VM))
+			return 0;

quoted

Is the !CLONE_VM case specific only to the default shadow stack
allocation? Sorry if this has been discussed already (or I completely
forgot) but I thought we'd only implement this for the thread creation
case. The typical fork() for a new process should inherit the parent's
layout, so applicable to the clone3() with the shadow stack arguments as
well (which should be ignored or maybe return an error with !CLONE_VM).

This is again all existing behaviour for the case where the user has not
specified a shadow stack reindented, as mentioned above if the user has
specified one explicitly then we just do what we were asked.  The
existing behaviour is to only create a new shadow stack for the child in
the CLONE_VM case and leave the child using the same shadow stack as the
parent in the copied mm for !CLONE_VM.

I guess I was rather questioning the current choices than the new
clone3() ABI. But even for the new clone3() ABI, does it make sense to
set up a shadow stack if the current stack isn't changed? We'll end up
with a lot of possible combinations that will never get tested but
potentially become obscure ABI. Limiting the options to the sane choices
only helps with validation and unsurprising changes later on.

quoted

@@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 	 */
 	trace_sched_process_fork(current, p);
 
+	shstk_post_fork(p, args);

quoted

Do we need this post fork call? Can we not handle the setup via the
copy_thread() path in shstk_alloc_thread_stack()?

It looks like we do actually have the new mm in the process before we
call copy_thread() so we could move things into there though we'd loose
a small bit of factoring out of the error handling (at one point I had
more code factored out but right now it's quite small, looking again we
could also factor out the get_task_mm()/mput()).  ISTR having the new
process' mm was the biggest reason for this initially but looking again
I'm not sure why that was.  It does still feel like even the small
amount that's factored out currently is useful though, a bit less
duplication in the architecture code which feels welcome here.

I think you can probably keep this. My comment was based on the
assumption that we only support the CLONE_VM case where we wouldn't need
the access_remote_vm(), just some direct write similar to
write_user_shstk_64().

I still think we should have limited this ABI to the CLONE_VM and
!CLONE_VFORK cases but I don't have a strong view if the consensus was
to allow it for classic fork() and vfork() like uses (I just think they
won't be used).

-- 
Catalin

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help