Re: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: 2023-03-08 20:03:35
Also in:
linux-arch, linux-doc, linux-mm, lkml
Subsystem:
the rest, x86 architecture (32-bit and 64-bit) · Maintainers:
Linus Torvalds, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen
On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote:
On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:quoted
From: Yu-cheng Yu <redacted> When a process is duplicated, but the child shares the address space with the parent, there is potential for the threads sharing a single stack to cause conflicts for each other. In the normal non-cet case this is handled"non-CET"
Sure.
quoted
in two ways. With regular CLONE_VM a new stack is provided by userspace such that the parent and child have different stacks. For vfork, the parent is suspended until the child exits. So as long as the child doesn't return from the vfork()/CLONE_VFORK calling function and sticks to a limited set of operations, the parent and child can share the same stack. For shadow stack, these scenarios present similar sharing problems. For the CLONE_VM case, the child and the parent must have separate shadow stacks. Instead of changing clone to take a shadow stack, have the kernel just allocate one and switch to it. Use stack_size passed from clone3() syscall for thread shadow stack size. A compat-mode thread shadow stack size is further reduced to 1/4. This allows more threads to run in a 32-bit address space. The clone() does not pass stack_size, which was added to clone3(). In that case, use RLIMIT_STACK size and cap to 4 GB. For shadow stack enabled vfork(), the parent and child can share the same shadow stack, like they can share a normal stack. Since the parent is suspended until the child terminates, the child will not interfere with the parent while executing as long as it doesn't return from the vfork() and overwrite up the shadow stack. The child can safely overwrite down the shadow stack, as the parent can just overwrite this later. So CET does not add any additional limitations for vfork(). Userspace implementing posix vfork() can actually prevent the child from"POSIX"
Ok.
...quoted
diff --git a/arch/x86/kernel/fpu/core.cb/arch/x86/kernel/fpu/core.c index f851558b673f..bc3de4aeb661 100644--- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c@@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(structfpu *dst_fpu) } } +#ifdef CONFIG_X86_USER_SHADOW_STACK +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp) +{ + struct cet_user_state *xstate; + + /* If ssp update is not needed. */ + if (!ssp) + return 0; + + xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave, + XFEATURE_CET_USER); + + /* + * If there is a non-zero ssp, then 'dst' must be configured with a shadow + * stack and the fpu state should be up to date since it was just copied + * from the parent in fpu_clone(). So there must be a valid non-init CET + * state location in the buffer. + */ + if (WARN_ON_ONCE(!xstate)) + return 1; + + xstate->user_ssp = (u64)ssp; + + return 0; +} +#else +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)^ ^^^^^^^^^^ ssp, like above. Better yet: static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp) { #ifdef CONFIG_X86_USER_SHADOW_STACK ... #endif return 0; } and less ifdeffery.
Sure. Sometimes people tell me to only ifdef out whole functions to make it easier to read. I suppose in this case it's not hard to see.
quoted
+{ + return 0; +} +#endif + /* Clone current's FPU state on fork */ -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal) +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal, + unsigned long ssp) { struct fpu *src_fpu = ¤t->thread.fpu; struct fpu *dst_fpu = &dst->thread.fpu;@@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst,unsigned long clone_flags, bool minimal) if (use_xsave()) dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID; + /* + * Update shadow stack pointer, in case it changed during clone. + */ + if (update_fpu_shstk(dst, ssp)) + return 1; + trace_x86_fpu_copy_src(src_fpu); trace_x86_fpu_copy_dst(dst_fpu);diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index b650cde3f64d..bf703f53fa49 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c@@ -48,6 +48,7 @@ #include <asm/frame.h> #include <asm/unwind.h> #include <asm/tdx.h> +#include <asm/shstk.h> #include "process.h"@@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk) free_vm86(t); + shstk_free(tsk); fpu__drop(fpu); }@@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, conststruct kernel_clone_args *args) struct inactive_task_frame *frame; struct fork_frame *fork_frame; struct pt_regs *childregs; + unsigned long shstk_addr = 0; int ret = 0; childregs = task_pt_regs(p);@@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, conststruct kernel_clone_args *args) frame->flags = X86_EFLAGS_FIXED; #endif - fpu_clone(p, clone_flags, args->fn); + /* Allocate a new shadow stack for pthread if needed */ + ret = shstk_alloc_thread_stack(p, clone_flags, args-quoted
stack_size,+ &shstk_addr);That function will return 0 even if shstk_addr hasn't been written in it and you will continue merrily and call fpu_clone(..., shstk_addr=0); why don't you return the shadow stack address or negative on error instead of adding an I/O parameter which is pretty much always nasty to deal with.
On a shadow stack allocation error, we fail the copy_thread(). When shadow stack is enabled, the app might be able to handle a clone failure, but would not be able to handle starting a new thread without getting a new shadow stack. So in your suggestion I guess we would have two types of failure one that signifies shadow stack is enabled and the allocation failed, and another that signifies that shadow stack is not enabled, so zero needs to be passed into fpu_clone()? We need the output param in shstk_alloc_thread_stack() because we need to update the SSP to the new shadow stack. If we want to make the non- shadow stack case handled differently, I think the extra conditionals are worse, like: /* Allocate a new shadow stack for pthread if needed */ ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size, &shstk_addr); if (ret == -EOPNOTSUPP) fpu_clone(p, clone_flags, args->fn, 0); else if (ret < 0) return ret; else fpu_clone(p, clone_flags, args->fn, shstk_addr); Do you think? It used to be that shstk_alloc_thread_stack() reached into FPU internals to do the SSP update itself. Then the ability to do this was removed. So I came up with an interface for allowing features to modify XSAVE buffers from outside the FPU code. On further discussion, letting code outside the FPU have flexible access to the XSAVE buffer could constrain the FPU code from adding optimizations. So Thomas suggested to pass the SSP along into FPU code so that the FPU modification could be all monolithic and flexible. If the default SSP value logic is too hidden, what about some clearer code and comments, like this?
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..bd123527fcca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const structkernel_clone_args *args)
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
struct pt_regs *childregs;
- unsigned long shstk_addr = 0;
+ unsigned long new_ssp;
int ret = 0;
childregs = task_pt_regs(p);@@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, conststruct kernel_clone_args *args)
frame->flags = X86_EFLAGS_FIXED;
#endif
- /* Allocate a new shadow stack for pthread if needed */
+ /*
+ * Allocate a new shadow stack for thread if needed. If shadow
stack,
+ * is disabled, new_ssp will remain 0, and fpu_clone() will
know not to
+ * update it.
+ */
+ new_ssp = 0;
ret = shstk_alloc_thread_stack(p, clone_flags, args-stack_size,
- &shstk_addr);
+ &new_ssp);
if (ret)
return ret;
- fpu_clone(p, clone_flags, args->fn, shstk_addr);
+ fpu_clone(p, clone_flags, args->fn, new_ssp);
/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {