Thread (154 messages) 154 messages, 12 authors, 2023-03-20

Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: 2023-03-08 20:03:35
Also in: linux-arch, linux-doc, linux-mm, lkml
Subsystem: the rest, x86 architecture (32-bit and 64-bit) · Maintainers: Linus Torvalds, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen

On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote:
On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
quoted
From: Yu-cheng Yu <redacted>

When a process is duplicated, but the child shares the address
space with
the parent, there is potential for the threads sharing a single
stack to
cause conflicts for each other. In the normal non-cet case this is
handled
"non-CET"
Sure.
quoted
in two ways.

With regular CLONE_VM a new stack is provided by userspace such
that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as
long as
the child doesn't return from the vfork()/CLONE_VFORK calling
function and
sticks to a limited set of operations, the parent and child can
share the
same stack.

For shadow stack, these scenarios present similar sharing problems.
For the
CLONE_VM case, the child and the parent must have separate shadow
stacks.
Instead of changing clone to take a shadow stack, have the kernel
just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack
size. A
compat-mode thread shadow stack size is further reduced to 1/4.
This
allows more threads to run in a 32-bit address space. The clone()
does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share
the same
shadow stack, like they can share a normal stack. Since the parent
is
suspended until the child terminates, the child will not interfere
with
the parent while executing as long as it doesn't return from the
vfork()
and overwrite up the shadow stack. The child can safely overwrite
down
the shadow stack, as the parent can just overwrite this later. So
CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child
from
"POSIX"
Ok.
...
quoted
diff --git a/arch/x86/kernel/fpu/core.c
b/arch/x86/kernel/fpu/core.c
index f851558b673f..bc3de4aeb661 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct
fpu *dst_fpu)
 	}
 }
 
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long
ssp)
+{
+	struct cet_user_state *xstate;
+
+	/* If ssp update is not needed. */
+	if (!ssp)
+		return 0;
+
+	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+				XFEATURE_CET_USER);
+
+	/*
+	 * If there is a non-zero ssp, then 'dst' must be configured
with a shadow
+	 * stack and the fpu state should be up to date since it was
just copied
+	 * from the parent in fpu_clone(). So there must be a valid
non-init CET
+	 * state location in the buffer.
+	 */
+	if (WARN_ON_ONCE(!xstate))
+		return 1;
+
+	xstate->user_ssp = (u64)ssp;
+
+	return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long
shstk_addr)
								      ^
^^^^^^^^^^
ssp, like above.

Better yet:

static int update_fpu_shstk(struct task_struct *dst, unsigned long
ssp)
{
#ifdef CONFIG_X86_USER_SHADOW_STACK
	...
#endif
	return 0;
}

and less ifdeffery.
Sure. Sometimes people tell me to only ifdef out whole functions to
make it easier to read. I suppose in this case it's not hard to see.


quoted
+{
+	return 0;
+}
+#endif
+
 /* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
bool minimal,
+	      unsigned long ssp)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst,
unsigned long clone_flags, bool minimal)
 	if (use_xsave())
 		dst_fpu->fpstate->regs.xsave.header.xfeatures &=
~XFEATURE_MASK_PASID;
 
+	/*
+	 * Update shadow stack pointer, in case it changed during
clone.
+	 */
+	if (update_fpu_shstk(dst, ssp))
+		return 1;
+
 	trace_x86_fpu_copy_src(src_fpu);
 	trace_x86_fpu_copy_dst(dst_fpu);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b650cde3f64d..bf703f53fa49 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -48,6 +48,7 @@
 #include <asm/frame.h>
 #include <asm/unwind.h>
 #include <asm/tdx.h>
+#include <asm/shstk.h>
 
 #include "process.h"
 
@@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
+	unsigned long shstk_addr = 0;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-	fpu_clone(p, clone_flags, args->fn);
+	/* Allocate a new shadow stack for pthread if needed */
+	ret = shstk_alloc_thread_stack(p, clone_flags, args-
quoted
stack_size,
+				       &shstk_addr);
That function will return 0 even if shstk_addr hasn't been written in
it
and you will continue merrily and call

	fpu_clone(..., shstk_addr=0);

why don't you return the shadow stack address or negative on error
instead of adding an I/O parameter which is pretty much always nasty
to
deal with.
On a shadow stack allocation error, we fail the copy_thread(). When
shadow stack is enabled, the app might be able to handle a clone
failure, but would not be able to handle starting a new thread without
getting a new shadow stack.

So in your suggestion I guess we would have two types of failure one
that signifies shadow stack is enabled and the allocation failed, and
another that signifies that shadow stack is not enabled, so zero needs
to be passed into fpu_clone()?

We need the output param in shstk_alloc_thread_stack() because we need
to update the SSP to the new shadow stack. If we want to make the non-
shadow stack case handled differently, I think the extra conditionals
are worse, like:
/* Allocate a new shadow stack for pthread if needed */
ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
				&shstk_addr);
if (ret == -EOPNOTSUPP)
	fpu_clone(p, clone_flags, args->fn, 0);
else if (ret < 0)
	return ret;
else
	fpu_clone(p, clone_flags, args->fn, shstk_addr);

Do you think?

It used to be that shstk_alloc_thread_stack() reached into FPU
internals to do the SSP update itself. Then the ability to do this was
removed. So I came up with an interface for allowing features to modify
XSAVE buffers from outside the FPU code. On further discussion, letting
code outside the FPU have flexible access to the XSAVE buffer could
constrain the FPU code from adding optimizations. So Thomas suggested
to pass the SSP along into FPU code so that the FPU modification could
be all monolithic and flexible.

If the default SSP value logic is too hidden, what about some clearer
code and comments, like this?
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..bd123527fcca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct
kernel_clone_args *args)
        struct inactive_task_frame *frame;
        struct fork_frame *fork_frame;
        struct pt_regs *childregs;
-       unsigned long shstk_addr = 0;
+       unsigned long new_ssp;
        int ret = 0;
 
        childregs = task_pt_regs(p);
@@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
        frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-       /* Allocate a new shadow stack for pthread if needed */
+       /*
+        * Allocate a new shadow stack for thread if needed. If shadow
stack,
+        * is disabled, new_ssp will remain 0, and fpu_clone() will
know not to
+        * update it.
+        */
+       new_ssp = 0;
        ret = shstk_alloc_thread_stack(p, clone_flags, args-
stack_size,
-                                      &shstk_addr);
+                                      &new_ssp);
        if (ret)
                return ret;
 
-       fpu_clone(p, clone_flags, args->fn, shstk_addr);
+       fpu_clone(p, clone_flags, args->fn, new_ssp);
 
        /* Kernel thread ? */
        if (unlikely(p->flags & PF_KTHREAD)) {

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help