Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

[RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Shuah Khan <hidden> · 2017-11-17
Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-17
[RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
[RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd · Shuah Khan <hidden> · 2017-11-17
[RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Michael Kerrisk (man-pages) <hidden> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andy Lutomirski <luto@amacapital.net> · 2017-11-18
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-21
[RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-19
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Thomas Gleixner <hidden> · 2017-11-20
[RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Ben Maurer <hidden> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Ben Maurer <hidden> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-19
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Linus Torvalds <torvalds@linux-foundation.org> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Andy Lutomirski <luto@amacapital.net> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Andy Lutomirski <luto@amacapital.net> · 2017-11-15
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Paul Turner <hidden> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2018-03-27

From: Mathieu Desnoyers <hidden>
Date: 2017-11-21 11:24:58
Also in: lkml

----- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:

quoted

----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:
The use-case for 4k memcpy operation is a per-cpu ring buffer where
the rseq fast-path does the following:

- ring buffer push: in the rseq asm instruction sequence, a memcpy of a
  given structure (limited to 4k in size) into a ring buffer,
  followed by the final commit instruction which increments the current
  position offset by the number of bytes pushed.

- ring buffer pop: in the rseq asm instruction sequence, a memcpy of
  a given structure (up to 4k) from the ring buffer, at "position" offset.
  The final commit instruction decrements the current position offset by
  the number of bytes pop'd.

Having cpu_opv do a 4k memcpy allow it to handle scenarios where
rseq fails to progress.

I'm still confused. Before you talked about the sequence:

   1) Reserve

   2) Commit

and both use rseq, but obviously you cannot do two "atomic" operation in
one section.

So now you talk about push. Is that what you described earlier as commit?

Assumed that it is, I still have no idea why the memcpy needs to happen in
that preempt disabled region.

If you have space reserved, then the copy does not have any dependencies
neither on the cpu it runs on nor on anything else. So the only
interresting operation is the final commit instruction which tells the
consumer that its ready.

So what is the part I am missing here aside of having difficulties to map
the constantly changing names of this stuff?

Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.

Let me explain the various use-cases here, so we know what we're talking
about.

rseq and cpu_opv use-cases

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).


2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.


3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.


4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).

The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.

The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.

The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).


5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
   supporting "peek" from remote CPU

In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.

If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.


6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
   (with and without acquire-release)

This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.

The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.

The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.

The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.


7) per-cpu FIFO ring buffer (fixed-sized queue)

This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.

An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.

This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.

On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.

Thanks,

Mathieu

Thanks,

	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help