Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

[PATCH v2 00/12] mm: jit/text allocator · Mike Rapoport <rppt@kernel.org> · 2023-06-16
[PATCH v2 01/12] nios2: define virtual address space for modules · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 01/12] nios2: define virtual address space for modules · "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> · 2023-06-16
Re: [PATCH v2 01/12] nios2: define virtual address space for modules · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 01/12] nios2: define virtual address space for modules · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-16
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Song Liu <song@kernel.org> · 2023-06-16
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · "Andy Lutomirski" <luto@kernel.org> · 2023-06-17
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-18
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · "Andy Lutomirski" <luto@kernel.org> · 2023-06-19
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Nadav Amit <hidden> · 2023-06-19
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · "Andy Lutomirski" <luto@kernel.org> · 2023-06-20
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-25
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · "Andy Lutomirski" <luto@kernel.org> · 2023-06-25
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-25
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-25
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Song Liu <song@kernel.org> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Puranjay Mohan <hidden> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mark Rutland <mark.rutland@arm.com> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mark Rutland <mark.rutland@arm.com> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Song Liu <song@kernel.org> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · "Andy Lutomirski" <luto@kernel.org> · 2023-07-17
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Mark Rutland <mark.rutland@arm.com> · 2023-06-26
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-19
[PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem · "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> · 2023-06-16
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem · Song Liu <song@kernel.org> · 2023-06-16
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem · Mike Rapoport <rppt@kernel.org> · 2023-06-17
[PATCH v2 05/12] modules, execmem: drop module_alloc · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 05/12] modules, execmem: drop module_alloc · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> · 2023-06-16
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Song Liu <song@kernel.org> · 2023-06-16
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Thomas Gleixner <hidden> · 2023-06-18
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-18
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Thomas Gleixner <hidden> · 2023-06-19
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-19
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Steven Rostedt <rostedt@goodmis.org> · 2023-06-20
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Alexei Starovoitov <hidden> · 2023-06-20
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() · Mike Rapoport <rppt@kernel.org> · 2023-06-19
[PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Song Liu <song@kernel.org> · 2023-06-16
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-17
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Song Liu <song@kernel.org> · 2023-06-17
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions · Kent Overstreet <kent.overstreet@linux.dev> · 2023-06-17
[PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES · Song Liu <song@kernel.org> · 2023-06-16
[PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES · Mike Rapoport <rppt@kernel.org> · 2023-06-16
Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES · Björn Töpel <bjorn@kernel.org> · 2023-06-16
Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES · Mike Rapoport <rppt@kernel.org> · 2023-06-17
Re: [PATCH v2 00/12] mm: jit/text allocator · "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> · 2023-06-16

From: Song Liu <song@kernel.org>
Date: 2023-06-26 06:13:38
Also in: bpf, linux-arm-kernel, linux-mips, linux-mm, linux-riscv, linux-s390, linux-trace-kernel, linuxppc-dev, lkml, loongarch, netdev, sparclinux

On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
[off-list ref] wrote:

On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:

quoted

On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:

quoted


On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:

quoted

On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:

quoted

On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:

quoted

On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:

quoted

On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:

quoted

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.

Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
module_alloc() and execmem_free() and jit_free() are replacements of
module_memfree() to allow updating all call sites to use the new APIs.

The intention semantics for new allocation APIs:

* execmem_text_alloc() should be used to allocate memory that must reside
  close to the kernel image, like loadable kernel modules and generated
  code that is restricted by relative addressing.

* jit_text_alloc() should be used to allocate memory for generated code
  when there are no restrictions for the code placement. For
  architectures that require that any code is within certain distance
  from the kernel image, jit_text_alloc() will be essentially aliased to
  execmem_text_alloc().

Is there anything in this series to help users do the appropriate
synchronization when the actually populate the allocated memory with
code?  See here, for example:

This series only factors out the executable allocations from modules and
puts them in a central place.
Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On
the other hand, this is probably the right time to at least start
thinking about synchronization, at least to the extent that it might make
us want to change this API.  (I'm not at all saying that this series
should require changes -- I'm just saying that this is a good time to
think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually
look like the one think in the Linux ecosystem that actually
intelligently and efficiently maps new text into an address space:
mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC
and jump to it with minimal synchronization (just the standard implicit
ordering in the kernel that populates the pages before setting up the
PTEs and whatever user synchronization is needed to avoid jumping into
the mapping before mmap() finishes).  It works across CPUs, and the only
possible way userspace can screw it up (for a read-only mapping of
read-only text, anyway) is to jump to the mapping too early, in which
case userspace gets a page fault.  Incoherence is impossible, and no one
needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other
architectures, too, although I think more cache management is needed on
the kernel's end.  As far as I know, no Linux SMP architecture needs an
IPI to map executable text into usermode, but I could easily be wrong.
(IIRC RISC-V has very developer-unfriendly icache management, but I don't
remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
rather fraught, and I bet many things do it wrong when userspace is
multithreaded.  But not in production because it's mostly not used in
production.)

But jit_text_alloc() can't do this, because the order of operations
doesn't match.  With jit_text_alloc(), the executable mapping shows up
before the text is populated, so there is no atomic change from not-there
to populated-and-executable.  Which means that there is an opportunity
for CPUs, speculatively or otherwise, to start filling various caches
with intermediate states of the text, which means that various
architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite
coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address,
but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on
x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some
fiddling, because off the top of my head, this doesn't match how it works
now.)

To make alternatives easier, this could work, maybe (haven't fully
thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra
heavy weight RCU to get "serialization") or send an IPI.

This essentially how modules work now. The memory is allocated RW, written
and updated with alternatives and then made ROX in the end with set_memory
APIs.

The issue with not having the memory mapped X when it's written is that we
cannot use large pages to map it. One of the goals is to have executable
memory mapped with large pages and make code allocator able to divide that
page among several callers.

So the idea was that jit_text_alloc() will have a cache of large pages
mapped ROX, will allocate memory from those caches and there will be
jit_update() that uses text poking for writing to that memory.

Upon allocation of a large page to increase the cache, that large page will
be "invalidated" by filling it with breakpoint instructions (e.g int3 on
x86)

Is this actually valid?  In between int3 and real code, there’s a
potential torn read of real code mixed up with 0xcc.

You mean while doing text poking?

I think we've been getting distracted by text_poke(). text_poke() does
updates via a different virtual address which introduce new
synchroniation wrinkles, but it's not the main issue.

As _think_ I understand it, the root of the issue is that speculative
execution - and that per Andy, speculative execution doesn't obey memory
barriers.

I have _not_ dug into the details of how retpolines work and all the
spectre stuff that was going on, but - retpoline uses lfence, doesn't
it? And if speculative execution is the issue here, isn't retpoline what
we need?

For this particular issue, I'm not sure "invalidate by filling with
illegal instructions" makes sense. For that to work, would the processor
have to execute a serialize operation and a retry on hitting an illegal
instruction - or perhaps we do in the interrupt handler?

But if filling with illegal instructions does act as a speculation
barrier, then the issue is that a torn read could generate a legal but
incorrect instruction.

What is a "torn read" here? I assume it is an instruction read that
goes at the wrong instruction boundary (CISC). If this is correct, do
we need to handle torn read caused by software bug, or hardware
bit flip, or both?

Thanks,
Song

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help