Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup

From: Luis Chamberlain <mcgrof@kernel.org>
Date: 2022-07-12 04:19:36
Also in: bpf, linux-mm, lkml

On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:

quoted

On Jul 8, 2022, at 3:24 PM, Luis Chamberlain [off-list ref] wrote:

quoted

1) Rename module_alloc_huge as module_alloc_text_huge();

module_alloc_text_huge() is too long, but I've suggested names before
which are short and generic, and also suggested that if modules are
not the only users this needs to go outside of modules and so
vmalloc_text_huge() or whatever.

To do this right it begs the question why we don't do that for the
existing module_alloc(), as the users of this code is well outside of
modules now. Last time a similar generic name was used all the special
arch stuff was left to be done by the module code still, but still
non-modules were still using that allocator. From my perspective the
right thing to do is to deal with all the arch stuff as well in the
generic handler, and have the module code *and* the other users which
use module_alloc() to use that new caller as well.

The key difference between module_alloc() and the new API is that the 
API will return RO+X memory, and the user need text-poke like API to
modify this buffer. Archs that do not support text-poke will not be
able to use the new API. Does this sound like a reasonable design?

I'm adding kprobe + ftrace folks.

I can't see why we need to *require* text_poke for just a
module_alloc_huge(). Enhancements on module_alloc() are just
enhancements, not requirements. So we have these for instance:

``` from arch/Kconfig
config ARCH_OPTIONAL_KERNEL_RWX
	def_bool n

config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	def_bool n

config ARCH_HAS_STRICT_KERNEL_RWX
	def_bool n

config STRICT_KERNEL_RWX
	bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_KERNEL_RWX
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, kernel text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. executing the heap
	  or modifying text)

	  These features are considered standard security practice these days.
	  You should say Y here in almost all cases.

config ARCH_HAS_STRICT_MODULE_RWX
	def_bool n

config STRICT_MODULE_RWX
	bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, module text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. writing to text)

With module_alloc() we have the above symbols to tell us when we *can*
support strict module rwx. So the way the kernel's modules are allocated
and used is:

for each module section:
	module_alloc()
module_enable_ro()
module_enable_nx()
module_enable_x()

The above can be read in the code as:

load_module() -->
	layout_and_allocate()
	complete_formation()

Then there is the consideration of set_vm_flush_reset_perms() for
freeing. On the module code we use this fore the RO+X stuff (core_layout,
init_layout), but now that is a bit obfuscated due to the placement of
the call. It would seem the other users use it for the same:

 * ebpf
 * kprobes
 * ftrace

I believe you are mentioning requiring text_poke() because the way
eBPF code uses the module_alloc() is different. Correct me if I'm
wrong, but from what I gather is you use the text_poke_copy() as the data
is already RO+X, contrary module_alloc() use cases. You do this since your
bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
module_alloc() and before you can use this memory. This is a different type
of allocator. And, again please correct me if I'm wrong but now you want to
share *one* 2 MiB huge-page for multiple BPF programs to help with the
impact of TLB misses.

A vmalloc_ro_exec() by definition would imply a text_poke().

Can kprobes, ftrace and modules use it too? It would be nice
so to not have to deal with the loose semantics on the user to
have to use set_vm_flush_reset_perms() on ro+x later, but
I think this can be addressed separately on a case by case basis.

But a vmalloc_ro_exec() with a respective free can remove the
requirement to do set_vm_flush_reset_perms().

  Luis

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help