[PATCH v5 3/5] x86: Split syscall_trace_enter into two phases

From: Kees Cook <hidden>
Date: 2015-02-05 23:12:44
Also in: linux-arch, linux-mips, lkml

On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski [off-list ref] wrote:

On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin [off-list ref] wrote:

quoted

On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:

quoted

On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin [off-list ref] wrote:

quoted

Hi,

On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:

quoted

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

This breaks ptrace, see below.

quoted

The intent is that phase 1 can be called from the syscall fast path.

In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.

In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 138 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c434db..86fc2bb82287 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h

@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
                       int error_code, int si_code);

+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+                                    unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index bbf338a04a5d..29576c244699 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c

@@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
      force_sig_info(SIGTRAP, &info, tsk);
 }

+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+     if (arch == AUDIT_ARCH_X86_64) {
+             audit_syscall_entry(arch, regs->orig_ax, regs->di,
+                                 regs->si, regs->dx, regs->r10);
+     } else
+#endif
+     {
+             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+                                 regs->cx, regs->dx, regs->si);
+     }
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:                        resume the syscall
+ * 1:                        go to phase 2; no seccomp phase 2 needed
+ * anything else:    go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-     long ret = 0;
+     unsigned long ret = 0;
+     u32 work;
+
+     BUG_ON(regs != task_pt_regs(current));
+
+     work = ACCESS_ONCE(current_thread_info()->flags) &
+             _TIF_WORK_SYSCALL_ENTRY;

      /*
       * If TIF_NOHZ is set, we are required to call user_exit() before
       * doing anything that could touch RCU.
       */
-     if (test_thread_flag(TIF_NOHZ))
+     if (work & _TIF_NOHZ) {
              user_exit();
+             work &= ~TIF_NOHZ;
+     }
+
+#ifdef CONFIG_SECCOMP
+     /*
+      * Do seccomp first -- it should minimize exposure of other
+      * code, and keeping seccomp fast is probably more valuable
+      * than the rest of this.
+      */
+     if (work & _TIF_SECCOMP) {
+             struct seccomp_data sd;
+
+             sd.arch = arch;
+             sd.nr = regs->orig_ax;
+             sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+             if (arch == AUDIT_ARCH_X86_64) {
+                     sd.args[0] = regs->di;
+                     sd.args[1] = regs->si;
+                     sd.args[2] = regs->dx;
+                     sd.args[3] = regs->r10;
+                     sd.args[4] = regs->r8;
+                     sd.args[5] = regs->r9;
+             } else
+#endif
+             {
+                     sd.args[0] = regs->bx;
+                     sd.args[1] = regs->cx;
+                     sd.args[2] = regs->dx;
+                     sd.args[3] = regs->si;
+                     sd.args[4] = regs->di;
+                     sd.args[5] = regs->bp;
+             }
+
+             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+             ret = seccomp_phase1(&sd);
+             if (ret == SECCOMP_PHASE1_SKIP) {
+                     regs->orig_ax = -1;

How the tracer is expected to get the correct syscall number after that?

There shouldn't be a tracer if a skip is encountered. (A seccomp skip
would skip ptrace.) This behavior hasn't changed, but maybe I don't
see what you mean? (I haven't encountered any problems with syscall
tracing as a result of these changes.)

SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
it will get -1 as a syscall number.

I've found this while testing a strace parser for
SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.

Hasn't it always been this way?

As far as I know, yes, it's always been this way. The point is to the
skip the syscall, which is what the -1 signals. Userspace then reads
back the errno.

I admit that I kind of wish this worked the other way -- that is, I
think it would be nice to have a mode in which ptrace runs before
seccomp, which would close the ptrace hole (where ptrace can do things
that seccomp would disallow) and maybe have more comprehensible
results.

I prefer keeping the seccomp attack surface as tiny as possible. I
would not like to ptrace happening before seccomp.

-Kees

-- 
Kees Cook
Chrome OS Security

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help