[PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
From: luto@amacapital.net (Andy Lutomirski)
Date: 2015-02-06 02:39:15
Also in:
linux-arch, linux-mips, lkml
On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin [off-list ref] wrote:
On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:quoted
On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook [off-list ref] wrote:quoted
On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin [off-list ref] wrote:[...]quoted
quoted
quoted
There is a clear difference: before these changes, SECCOMP_RET_ERRNO used to keep the syscall number unchanged and suppress syscall-exit-stop event, which was awful because userspace cannot distinguish syscall-enter-stop from syscall-exit-stop and therefore relies on the kernel that syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.). After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop events to be suppressed, but now the syscall number is lost.Ah-ha! Okay, thanks, I understand now. I think this means seccomp phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you think here?I still don't quite see how this change caused this.I have a test for this at http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.cquoted
I can play with it a bit more. But RET_ERRNO *has* to be some kind of skip event, because it needs to skip the syscall. We could change this by treating RET_ERRNO as an instruction to enter phase 2 and then asking for a skip in phase 2 without changing orig_ax, but IMO this is pretty ugly. I think this all kind of sucks. We're trying to run ptrace after seccomp, so ptrace is seeing the syscalls as transformed by seccomp. That means that if we use RET_TRAP, then ptrace will see the possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO correctly given the current design) showing syscall -1, and if we use RET_KILL, then ptrace just sees the process mysteriously die.Userspace is usually not prepared to see syscall -1. For example, strace had to be patched, otherwise it just skipped such syscalls as "not a syscall" events or did other improper things: http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486 http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
The x32 thing is a legit ABI bug, I'd argue. I'd be happy to submit a patch to fix that (clear the x32 bit if we're not x32).
A slightly different but related story: userspace is also not prepared to handle large errno values produced by seccomp filters like this: BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA) For example, glibc assumes that syscalls do not return errno values greater than 0xfff: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55 https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20 If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
I think this is solidly in the "don't do that" category. Seccomp lets you tamper with syscalls. If you tamper wrong, then you lose. Kees, what do you think about reversing the whole thing so that ptracers always see the original syscall? --Andy