Thread (12 messages) 12 messages, 6 authors, 2020-01-10

Re: [PATCH bpf-next] bpf: Make trampolines W^X

From: Andy Lutomirski <luto@kernel.org>
Date: 2020-01-08 08:41:58
Also in: bpf, lkml

On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P [off-list ref] wrote:

CC Nadav and Jessica.

On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
quoted
quoted
On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P [off-list ref]
wrote:

On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
quoted
quoted
quoted
quoted
On Jan 4, 2020, at 8:47 AM, KP Singh [off-list ref] wrote:
From: KP Singh [off-list ref]

The image for the BPF trampolines is allocated with
bpf_jit_alloc_exe_page which marks this allocated page executable. This
means that the allocated memory is W and X at the same time making it
susceptible to WX based attacks.

Since the allocated memory is shared between two trampolines (the
current and the next), 2 pages must be allocated to adhere to W^X and
the following sequence is obeyed where trampolines are modified:
Can we please do better rather than piling garbage on top of garbage?
quoted
- Mark memory as non executable (set_memory_nx). While module_alloc for
x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
all implementations of module_alloc do so
How about fixing this instead?
quoted
- Mark the memory as read/write (set_memory_rw)
Probably harmless, but see above about fixing it.
quoted
- Modify the trampoline
Seems reasonable. It’s worth noting that this whole approach is
suboptimal:
the “module” allocator should really be returning a list of pages to be
written (not at the final address!) with the actual executable mapping to
be
materialized later, but that’s a bigger project that you’re welcome to
ignore
for now.  (Concretely, it should produce a vmap address with backing pages
but
with the vmap alias either entirely unmapped or read-only. A subsequent
healer
would, all at once, make the direct map pages RO or not-present and make
the
vmap alias RX.)
quoted
- Mark the memory as read-only (set_memory_ro)
- Mark the memory as executable (set_memory_x)
No, thanks. There’s very little excuse for doing two IPI flushes when one
would suffice.

As far as I know, all architectures can do this with a single flush
without
races  x86 certainly can. The module freeing code gets this sequence
right.
Please reuse its mechanism or, if needed, export the relevant interfaces.
So if I understand this right, some trampolines have been added that are
currently set as RWX at modification time AND left that way during runtime?
The
discussion on the order of set_memory_() calls in the commit message made me
think that this was just a modification time thing at first.
I’m not sure what the status quo is.

We really ought to have a genuinely good API for allocation and initialization
of text.  We can do so much better than set_memory_blahblah.

FWIW, I have some ideas about making kernel flushes cheaper. It’s currently
blocked on finding some time and on tglx’s irqtrace work.
Makes sense to me. I guess there are 6 types of text allocations now:
- These two BPF trampolines
- BPF JITs
- Modules
- Kprobes
- Ftrace

All doing (or should be doing) pretty much the same thing. I believe Jessica had
said at one point that she didn't like all the other features using
module_alloc() as it was supposed to be just for real modules. Where would the
API live?
New header?  This shouldn’t matter that much.

Here are two strawman proposals.  All of this is very rough -- the
actual data structures and signatures are likely problematic for
multiple reasons.
--- First proposal ---
struct text_allocation {
  void *final_addr;
  struct page *pages;
  int npages;
};

int text_alloc(struct text_allocation *out, size_t size);

/* now final_addr is not accessible and pages is writable. */

int text_freeze(struct text_allocation *alloc);

/* now pages are not accessible and final_addr is RO.  Alternatively,
pages are RO and final_addr is unmapped. */

int text_finish(struct text_allocation *alloc);

/* now final_addr is RX.  All done. */

This gets it with just one flush and gives a chance to double-check in
case of race attacks from other CPUs.  Double-checking is annoying,
though.
--- Second proposal ---
struct text_allocation {
  void *final_addr;
  /* lots of opaque stuff including an mm_struct */
  /* optional: list of struct page, but this isn't obviously useful */
};

int text_alloc(struct text_allocation *out, size_t size);

/* Memory is allocated.  There is no way to access it at all right
now.  The memory is RO or not present in the direct map. */

void __user *text_activate_mapping(struct text_allocation *out);

/* Now the text is RW at *user* address given by return value.
Preemption is off if required by use_temporary_mm().  Real user memory
cannot be accessed. */

void text_deactivate_mapping(struct text_allocation *alloc);

/* Now the memory is inaccessible again. */

void text_finalize(struct text_allocation *alloc);

/* Now it's RX or XO at the final address. */


Pros of second approach:

 - Inherently immune to cross-CPU attack.  No double-check.

 - If we ever implement a cache of non-direct-mapped, unaliased pages,
then it works with no flushes at all.  We could even relax it a bit to
allow non-direct-mapped pages that may have RX / XO aliases but no W
aliases.

 - Can easily access without worrying about page boundaries.

Cons:

 - The use of a temporary mm is annoying -- you can't copy from user
memory, for example.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help