Thread (9 messages) 9 messages, 2 authors, 2022-07-13

Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup

From: Song Liu <hidden>
Date: 2022-07-12 23:12:31
Also in: bpf, linux-mm, lkml

On Jul 12, 2022, at 12:04 PM, Luis Chamberlain [off-list ref] wrote:

On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
quoted
quoted
On Jul 11, 2022, at 9:18 PM, Luis Chamberlain [off-list ref] wrote:
quoted
I believe you are mentioning requiring text_poke() because the way
eBPF code uses the module_alloc() is different. Correct me if I'm
wrong, but from what I gather is you use the text_poke_copy() as the data
is already RO+X, contrary module_alloc() use cases. You do this since your
bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
module_alloc() and before you can use this memory. This is a different type
of allocator. And, again please correct me if I'm wrong but now you want to
share *one* 2 MiB huge-page for multiple BPF programs to help with the
impact of TLB misses.
Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
much value on top of current module_alloc(). 
Thanks for the clarification.
quoted
quoted
A vmalloc_ro_exec() by definition would imply a text_poke().

Can kprobes, ftrace and modules use it too? It would be nice
so to not have to deal with the loose semantics on the user to
have to use set_vm_flush_reset_perms() on ro+x later, but
I think this can be addressed separately on a case by case basis.
I am pretty confident that kprobe and ftrace can share huge pages with 
BPF programs.
Then wonderful, we know where to go in terms of a new API then as it
can be shared in the future for sure and there are gains.
quoted
I haven't looked into all the details with modules, but 
given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
possible.
Sure.
quoted
Once this is done, a regular system (without huge BPF program or huge
modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
and bpf programs. 
That would be nice, if possible, however modules will require likely its
own thing, on my system I see about 57 MiB used on coresize alone.

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename | \
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
60001272

And so perhaps we need such a pool size to be configurable.
quoted
quoted
But a vmalloc_ro_exec() with a respective free can remove the
requirement to do set_vm_flush_reset_perms().
Removing the requirement to set_vm_flush_reset_perms() is the other
reason to go directly to vmalloc_ro_exec(). 
Yes fantastic.
quoted
My current version looks like this:

void *vmalloc_exec(unsigned long size);
void vfree_exec(void *ptr, unsigned int size);

ro is eliminated as there is no rw version of the API. 
Alright.

I am not sure if 2 MiB will suffice given what I mentioned above, and
what to do to ensure this grows at a reasonable pace. Then, at least for
usage for all architectures since not all will support text_poke() we
will want to consider a way to make it easy to users to use non huge
page fallbacks, but that would be up to those users, so we can wait for
that.
We are not limited to 2MiB total. The logic is like: 

1. Anything bigger than 2MiB gets its own allocation.
2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
   these pages are in use. 
3. For objects smaller than 2MiB, we will try to fit it in one of these
   pages. 
   3. a) If there isn't a page with big enough continuous free space, we
        will allocate a new 2MiB page. 

(For system with n NUMA nodes, multiple 2MiB above by n). 

So, if we have 100 kernel modules using 1MiB each, they will share 50x
2MiB pages. 

Thanks,
Song
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help