Thread (24 messages) 24 messages, 5 authors, 2020-06-17

Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

From: Kees Cook <hidden>
Date: 2020-06-16 18:49:43
Also in: linux-api, lkml

On Tue, Jun 16, 2020 at 08:36:28PM +0200, Jann Horn wrote:
On Tue, Jun 16, 2020 at 5:49 PM Kees Cook [off-list ref] wrote:
quoted
On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
quoted
Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
Initially, I started down this path. It needed a bit of plumbing into
BPF to better control the lifetime of the cBPF "saved original filter"
(normally used by CHECKPOINT_RESTORE uses)
I don't think you need that? When a filter is added, you can compute
the results of the added individual filter, and then merge the state.
That's what I thought too, but unfortunately not (unless I missed
something) -- the seccomp verifier is run as a callback from the BPF
internals, so seccomp only see what the user sends (which is unverified)
and the final eBPF filter. There isn't state I can attach during the
callback, so I opted to just do the same thing as CHECKPOINT_RESTORE,
but to then explicitly free the cBPF after bitmap generation.
quoted
and then I needed to keep
making exceptions (same list you have: ALU, X register, scratch, etc)
in the name of avoiding too much complexity in the emulator. I decided
I'd rather reuse the existing infrastructure to actually execute the
filter (no cBPF copy needed to be saved, no separate code, and full
instruction coverage).
If you really think that this bit of emulation is so bad, you could
also make a copy of the BPF filter in which you replace all load
instructions from syscall arguments with "return NON_CONSTANT_RESULT",
and then run that through the normal BPF infrastructure.
quoted
quoted
Something like (completely untested):
[...]
quoted
I didn't actually finish going down the emulator path (I stopped right
around the time I verified that libseccomp does use BPF_ALU -- though
only BPF_AND), so I didn't actually evaluate the filter contents for other
filter builders (i.e. Chrome).

But, if BPF_ALU | BPF_AND were added to your code above, it would cover
everything libseccomp generates (which covers a lot of the seccomp
filters, e.g. systemd, docker). I just felt funny about an "incomplete"
emulator.

Though now you've got me looking. It seems this is the core
of Chrome's BPF instruction generation:
https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
It also uses ALU|AND, but adds JMP|JSET.

So... that's only 2 more instructions to cover what I think are likely
the two largest seccomp instruction generators.
quoted
That way, you won't need any of this complicated architecture-specific stuff.
There are two arch-specific needs, and using a cBPF-subset emulator
just gets rid of the local TLB flush. The other part is distinguishing
the archs. Neither requirement is onerous (TLB flush usually just
needs little more than an extern, arch is already documented in the
per-arch syscall_get_arch()).
But it's also somewhat layer-breaking and reliant on very specific
assumptions. Normal kernel code doesn't mess around with page table
magic, outside of very specific low-level things. And your method
would break if the fixed-value members were not all packed together at
the start of the structure.
Right -- that was lucky. I suspect the emulation route will win out
here.
And from a hardening perspective: The more code we add that fiddles
around with PTEs directly, rather than going through higher-level
abstractions, the higher the chance that something gets horribly
screwed up. For example, this bit from your patch looks *really*
suspect:

+                       preempt_disable();
+                       set_pte_at(&init_mm, vaddr, ptep,
pte_mkold(*(READ_ONCE(ptep))));
+                       local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+                       preempt_enable();

First off, that set_pte_at() is just a memory write; I don't see why
you put it inside a preempt_disable() region.
But more importantly, sticking a local TLB flush inside a
preempt_disable() region with nothing else in there looks really
shady. How is that supposed to work? If we migrate from CPU0 to CPU1
directly before this region, and then from CPU1 back to CPU0 directly
afterwards, the local TLB flush will have no effect.
Yeah, true, that's another good reason not to do this.

-- 
Kees Cook
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help