Thread (99 messages) 99 messages, 10 authors, 2019-04-04

Re: [PATCH v3 2/5] ring: add a non-blocking implementation

From: Eads, Gage <hidden>
Date: 2019-01-25 17:21:30

-----Original Message-----
From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
Sent: Wednesday, January 23, 2019 4:16 AM
To: Eads, Gage <redacted>; dev@dpdk.org
Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
[off-list ref]; Richardson, Bruce [off-list ref];
arybchenko@solarflare.com; Ananyev, Konstantin
[off-list ref]
Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation

On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
quoted
Hi Ola,

<snip>
quoted
quoted
@@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
rte_ring *r);
 #endif
 #include "rte_ring_generic_64.h"

+/* @internal 128-bit structure used by the non-blocking ring */
+struct nb_ring_entry {
+	void *ptr; /**< Data pointer */
+	uint64_t cnt; /**< Modification counter */
Why not make 'cnt' uintptr_t? This way 32-bit architectures will
also be supported. I think there are some claims that DPDK still supports e.g.
ARMv7a
and possibly also 32-bit x86?
I chose a 64-bit modification counter because (practically speaking)
the ABA problem will not occur with such a large counter -- definitely
not within my lifetime. See the "Discussion" section of the commit
message for more information.

With a 32-bit counter, there is a very (very) low likelihood of it,
but it is possible. Personally, I don't feel comfortable providing
such code, because a) I doubt all users would understand the
implementation well enough to do the risk/reward analysis, and b) such
a bug would be near impossible to reproduce and root-cause if it did occur.
With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. ARMv7a and
probably x86 as well) won't be able to support this as they at best support 64-bit
CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit (and
128-bit CAS) requirement on the implementation.
Yes, I am. I tried to make that clear in the cover letter.
quoted
quoted
quoted
+};
+
+/* The non-blocking ring algorithm is based on the original rte
+ring (derived
+ * from FreeBSD's bufring.h) and inspired by Michael and Scott's
+non-blocking
+ * concurrent queue.
+ */
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring
+(single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
+the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
+to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has
+finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	size_t head, next;
+
+	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->prod_64.tail += n;
Don't we need release order when (or smp_wmb between) writing of the
ring pointers and the update of tail? By updating the tail pointer,
we are synchronising with a consumer.

I prefer using __atomic operations even for load and store. You can
see which parts of the code that synchronise with each other, e.g.
store-release to some location synchronises with load-acquire from
the same location. If you don't know how different threads
synchronise with each other, you are very likely to make mistakes.
You can tell this code was written when I thought x86-64 was the only
viable target :). Yes, you are correct.

With regards to using __atomic intrinsics, I'm planning on taking a
similar approach to the functions duplicated in rte_ring_generic.h and
rte_ring_c11_mem.h: one version that uses rte_atomic functions (and
thus stricter memory ordering) and one that uses __atomic intrinsics
(and thus can benefit from more relaxed memory ordering).
What's the advantage of having two different implementations? What is the
disadvantage?

The existing ring buffer code originally had only the "legacy" implementation
which was kept when the __atomic implementation was added. The reason
claimed was that some older compilers for x86 do not support GCC __atomic
builtins. But I thought there was consensus that new functionality could have
only __atomic implementations.
When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left disabled for thunderx[1] for performance reasons. Assuming that hasn't changed, the advantage to having two versions is to best support all of DPDK's platforms. The disadvantage is of course duplicated code and the additional maintenance burden.

That said, if the thunderx maintainers are ok with it, I'm certainly open to only doing the __atomic version. Note that even in the __atomic version, based on Honnapa's findings[2], using a DPDK-defined rte_atomic128_cmpset() (with additional arguments to support machines with weak consistency) appears to be a better option than __atomic_compare_exchange_16.

I couldn't find the discussion about new functionality using __atomic going forward -- can you send a link?

[1] https://mails.dpdk.org/archives/dev/2017-December/082853.html
[2] http://mails.dpdk.org/archives/dev/2019-January/124002.html
Does the non-blocking ring buffer implementation have to support these older
compilers? Will the applications that require these older compiler be updated to
utilise the non-blocking ring buffer?
(See above -- compiler versions wasn't a consideration here.)
quoted
quoted
quoted
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring
+(multi-producer
+safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
+the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
+to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has
+finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const
*obj_table,
quoted
quoted
quoted
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+#ifndef ALLOW_EXPERIMENTAL_API
+	printf("[%s()] RING_F_NB requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+#endif
+	return 0;
+#endif
+#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
+	size_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the
race,
+		 * retry with the new tail.
+		 */
+		tail = r->prod_64.tail;
+
+		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r-
quoted
mask];
This is an ugly expression and cast. Also I think it is unnecessary.
What's preventing this from being written without a cast? Perhaps
the ring array needs to be a union of "void *" and struct
nb_ring_entry?
The cast is necessary for the correct pointer arithmetic (let
"uintptr_t base == &r[1]"):
Yes I know the C language.
quoted
- With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail &
r-
quoted
mask);
- W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail &
r->mask);

FWIW, this is essentially the same as is done with the second argument
(&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across
multiple lines of code. The equivalent here would be:

struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
ring_ptr = ring_base[tail & r->mask];

Which is more legible, I think.
The RTE ring buffer code is not very legible to start with.
quoted
There is no ring array structure in which to add a union; the ring
array is a contiguous chunk of memory that immediately follows after
the end of a struct rte_ring. We interpret the memory there according
to the ring entry data type (void * for regular rings and struct nb_ring_entry for
non-blocking rings).
My worry is that we are abusing the C language and creating a monster of
fragile C code that will be more and more difficult to understand and to
maintain. At some point you have to think the question "Are we doing the right
thing?".
I'm not aware of any fragility/maintainability issues in the ring code (though perhaps the maintainers have a different view!), and personally I find the code fairly legible. If you have a specific suggestion, I'll look into incorporating it.

Thanks,
Gage

</snip>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help