Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

From: Nicholas Piggin <npiggin@gmail.com>
Date: 2020-12-29 03:10:16
Also in: linux-arm-kernel, lkml, stable

Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am:

On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin [off-list ref] wrote:

quoted

Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:

quoted

On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
[off-list ref] wrote:

quoted

----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

quoted

On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
[off-list ref] wrote:

quoted

On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

quoted

After chatting with rmk about this (but without claiming that any of
this is his opinion), based on the manpage, I think membarrier()
currently doesn't really claim to be synchronizing caches? It just
serializes cores. So arguably if userspace wants to use membarrier()
to synchronize code changes, userspace should first do the code
change, then flush icache as appropriate for the architecture, and
then do the membarrier() to ensure that the old code is unused?

^ exactly, yes.

quoted

For 32-bit arm, rmk pointed out that that would be the cacheflush()
syscall. That might cause you to end up with two IPIs instead of one
in total, but we probably don't care _that_ much about extra IPIs on
32-bit arm?

This was the original thinking, yes. The cacheflush IPI will flush specific
regions of code, and the membarrier IPI issues context synchronizing
instructions.

APIs should be written in terms of the service they provide to
userspace, and in highest level terms as possible, rather than directing
hardware to do some low level operation. Unfortunately we're stuck with
this for now. We could deprecate it and replace it though.

If userspace wants to modify code and ensure that after the system call
returns then no other thread will be executing the previous code, then
there should be an API for that. It could actually combine the two IPIs
into one for architectures that require both too.

I agree.  The membarrier API for SYNC_CORE is pretty nasty.  I would
much prefer a real API for JITs to use.

quoted

Architectures with coherent i/d caches don't need the cacheflush step.

There are different levels of coherency -- VIVT architectures may have
differing requirements compared to PIPT, etc.

In any case, I feel like the approach taken by the documentation is
fundamentally confusing.  Architectures don't all speak the same
language  How about something like:

The SYNC_CORE operation causes all threads in the caller's address
space (including the caller) to execute an architecture-defined
barrier operation.  membarrier() will ensure that this barrier is
executed at a time such that all data writes done by the calling
thread before membarrier() are made visible by the barrier.
Additional architecture-dependent cache management operations may be
required to use this for JIT code.

As said this isn't what SYNC_CORE does, and it's not what powerpc
context synchronizing instructions do either, it will very much
re-order visibility of stores around such an instruction.

Perhaps the docs should be entirely arch-specific.  It may well be
impossible to state what it does in an arch-neutral way.

I think what I wrote above -- after the call returns, none of the
threads in the process will be executing instructions overwritten
previously by the caller (provided their i-caches are made coherent
with the caller's modifications).

quoted

A thread completes store instructions into a store queue, which is
as far as a context synchronizing event goes. Visibility comes at
some indeterminite time later.

As currently implemented, it has the same visibility semantics as
regular membarrier, too.

Ah I actually missed that SYNC_CORE is in _addition_ to the memory
barriers, and that's documented API too, not just implementation
sorry.

So if I do:

a = 1;
membarrier(SYNC_CORE);
b = 1;

and another thread does:

while (READ_ONCE(b) != 1)
  ;
barrier();
assert(a == 1);

Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither
that barrier or the added SYNC_CORE behaviour imply visibility though.

then the assertion will pass.  Similarly, one can do this, I hope:

memcpy(codeptr, [some new instructions], len);
arch_dependent_cache_flush(codeptr, len);
membarrier(SYNC_CORE);
ready = 1;

and another thread does:

while (READ_ONCE(ready) != 1)
  ;
barrier();
(*codeptr)();

arch_dependent_cache_flush is a nop on x86.  On arm and arm64, it
appears to be a syscall, although maybe arm64 can do it from
userspace.  I still don't know what it is on powerpc.

Even using the term "cache" here is misleading.  x86 chips have all
kinds of barely-documented instruction caches, and they have varying
degrees of coherency.  The architecture actually promises that, if you
do a certain incantation, then you get the desired result.
membarrier() allows a user to do this incantation.  But trying to
replicate the incantation verbatim on an architecture like ARM is
insufficient, and trying to flush the things that are documented as
being caches on x86 is expensive and a complete waste of time on x86.
When you write some JIT code, you do *not* want to flush it all the
way to main memory, especially on CPUs don't have the ability to write
back invalidating.  (That's most CPUs.)

Even on x86, I suspect that the various decoded insn caches are rather
more coherent than documented, and I have questions in to Intel about
this.  No answers yet.

So perhaps the right approach is to say that membarrier() helps you
perform the architecture-specific sequence of steps needed to safely
modify code.  On x86, you use it like this.  On arm64, you do this
other thing.  On powerpc, you do something else.

I think it should certainly be documented in terms of what guarantees
it provides to application, _not_ the kinds of instructions it may or
may not induce the core to execute. And if existing API can't be
re-documented sanely, then deprecatd and new ones added that DTRT.
Possibly under a new system call, if arch's like ARM want a range
flush and we don't want to expand the multiplexing behaviour of
membarrier even more (sigh).

quoted

I would be surprised if x86's serializing instructions were different
than powerpc. rdtsc ordering or flushing stores to cache would be
surprising.

At the very least, x86 has several levels of what ARM might call
"context synchronization" AFAICT.  STAC, CLAC, and POPF do a form of
context synchronization in that the changes they cause to the MMU take
effect immediately, but they are not documented as synchronizing the
instruction stream.  "Serializing" instructions do all kinds of
things, not all of which may be architecturally visible at all.
MFENCE and LFENCE do various complicated things, and LFENCE has magic
retroactive capabilities on old CPUs that were not documented when
those CPUs were released.  SFENCE does a different form of
synchronization entirely.  LOCK does something else.  (The
relationship between LOCK and MFENCE is confusing at best.)  RDTSC
doesn't serialize anything at all, but RDTSCP does provide a form of
serialization that's kind of ilke LFENCE.  It's a mess.  Even the
manuals are inconsistent about what "serialize" means.  JMP has its
own magic on x86, but only on very very old CPUs.

The specific instruction that flushes everything into the coherency
domain is SFENCE, and SFENCE is not, for normal purposes, needed for
self- or cross-modifying code.

Good reason to avoid such language in the system call interface!

Thanks,
Nick

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help