Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
From: Luis Chamberlain <mcgrof@kernel.org>
Date: 2022-07-12 04:19:36
Also in:
bpf, linux-mm, lkml
On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:
quoted
On Jul 8, 2022, at 3:24 PM, Luis Chamberlain [off-list ref] wrote:quoted
1) Rename module_alloc_huge as module_alloc_text_huge();module_alloc_text_huge() is too long, but I've suggested names before which are short and generic, and also suggested that if modules are not the only users this needs to go outside of modules and so vmalloc_text_huge() or whatever. To do this right it begs the question why we don't do that for the existing module_alloc(), as the users of this code is well outside of modules now. Last time a similar generic name was used all the special arch stuff was left to be done by the module code still, but still non-modules were still using that allocator. From my perspective the right thing to do is to deal with all the arch stuff as well in the generic handler, and have the module code *and* the other users which use module_alloc() to use that new caller as well.The key difference between module_alloc() and the new API is that the API will return RO+X memory, and the user need text-poke like API to modify this buffer. Archs that do not support text-poke will not be able to use the new API. Does this sound like a reasonable design?
I'm adding kprobe + ftrace folks. I can't see why we need to *require* text_poke for just a module_alloc_huge(). Enhancements on module_alloc() are just enhancements, not requirements. So we have these for instance: ``` from arch/Kconfig config ARCH_OPTIONAL_KERNEL_RWX def_bool n config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT def_bool n config ARCH_HAS_STRICT_KERNEL_RWX def_bool n config STRICT_KERNEL_RWX bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_KERNEL_RWX default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, kernel text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. executing the heap or modifying text) These features are considered standard security practice these days. You should say Y here in almost all cases. config ARCH_HAS_STRICT_MODULE_RWX def_bool n config STRICT_MODULE_RWX bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, module text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. writing to text)
With module_alloc() we have the above symbols to tell us when we *can*
support strict module rwx. So the way the kernel's modules are allocated
and used is:
for each module section:
module_alloc()
module_enable_ro()
module_enable_nx()
module_enable_x()
The above can be read in the code as:
load_module() -->
layout_and_allocate()
complete_formation()
Then there is the consideration of set_vm_flush_reset_perms() for
freeing. On the module code we use this fore the RO+X stuff (core_layout,
init_layout), but now that is a bit obfuscated due to the placement of
the call. It would seem the other users use it for the same:
* ebpf
* kprobes
* ftrace
I believe you are mentioning requiring text_poke() because the way
eBPF code uses the module_alloc() is different. Correct me if I'm
wrong, but from what I gather is you use the text_poke_copy() as the data
is already RO+X, contrary module_alloc() use cases. You do this since your
bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
module_alloc() and before you can use this memory. This is a different type
of allocator. And, again please correct me if I'm wrong but now you want to
share *one* 2 MiB huge-page for multiple BPF programs to help with the
impact of TLB misses.
A vmalloc_ro_exec() by definition would imply a text_poke().
Can kprobes, ftrace and modules use it too? It would be nice
so to not have to deal with the loose semantics on the user to
have to use set_vm_flush_reset_perms() on ro+x later, but
I think this can be addressed separately on a case by case basis.
But a vmalloc_ro_exec() with a respective free can remove the
requirement to do set_vm_flush_reset_perms().
Luis