Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace

From: Jann Horn <jannh@google.com>
Date: 2018-11-01 23:48:06
Also in: linux-fsdevel, lkml

On Thu, Nov 1, 2018 at 3:10 PM James Bottomley
[off-list ref] wrote:

On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote:

quoted

On Thu, Nov 1, 2018 at 3:59 AM James Bottomley
[off-list ref] wrote:

quoted

On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote:

quoted

Hi,

Any comment on this last version?

Any chance to be merged?

I've got a use case for this:  I went to one of the Graphene talks
in Edinburgh and it struck me that we seem to keep reinventing the
type of sandboxing that qemu-user already does.  However if you
want to do an x86 on x86 sandbox, you can't currently use the
binfmt_misc mechanism because that has you running *every* binary
on the system emulated. Doing it per user namespace fixes this
problem and allows us to at least cut down on all the pointless
duplication.

Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes
your code slower and *LESS* secure. As far as I know, qemu-user is
only intended for purposes like development and testing.

Sandboxing is about protecting the cloud service provider (and other
tenants) from horizontal attack by reducing calls to the shared kernel.
 I think it's pretty indisputable that full emulation is an effective
sandbox in that regard.

We can argue for about bugginess vs completeness, but technologically
qemu-user already has most of the system calls, which seems to be a
significant problem with other sandboxes.  I also can't dispute it's
slower, but that's a tradeoff for people to make.

I'm pretty sure you don't understand how qemu-user works.

When the emulated code makes a syscall, QEMU just forwards the syscall
to the native kernel.

QEMU doesn't even prevent you from accessing the address space used by
the emulation logic.

qemu-user is not for sandboxing. qemu-user is not for security.
qemu-user is for running binaries from architecture A on architecture
B, with as much direct access to the kernel's syscall surface as
possible.


An example:

$ cat blah.c
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
int main(void) {
  open("/foo/bar/blah", O_RDONLY);
  char c;
  printf("ptr is %p\n", &c);
  read(1337, &c, 1);
  *(volatile char *)0x13371338;
}
$ aarch64-linux-gnu-gcc -static -o blah blah.c && strace -f qemu-aarch64 ./blah
[...]
[pid 14181] openat(AT_FDCWD, "/foo/bar/blah", O_RDONLY) = -1 ENOENT
(No such file or directory)
[pid 14181] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 93), ...}) = 0
[pid 14181] write(1, "ptr is 0x40007fff2f\n", 20ptr is 0x40007fff2f
) = 20
[pid 14181] read(1337, 0x40007fff2f, 1) = -1 EBADF (Bad file descriptor)
[pid 14181] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR,
si_addr=0x13371338} ---
[...]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help