Re: [PATCH 1/1] eal: add 128-bit cmpset (x86-64 only)

From: Ola Liljedahl <hidden>
Date: 2019-02-01 19:01:39

On Fri, 2019-02-01 at 17:06 +0000, Eads, Gage wrote:

quoted

-----Original Message-----
From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
Sent: Monday, January 28, 2019 5:02 PM
To: Eads, Gage <redacted>; dev@dpdk.org
Cc: arybchenko@solarflare.com; jerinj@marvell.com;
chaozhu@linux.vnet.ibm.com; nd [off-list ref]; Richardson, Bruce
[off-list ref]; Ananyev, Konstantin
[off-list ref]; hemant.agrawal@nxp.com;
olivier.matz@6wind.com; Honnappa Nagarahalli
[off-list ref]; Gavin Hu (Arm Technology China)
[off-list ref]
Subject: Re: [dpdk-dev] [PATCH 1/1] eal: add 128-bit cmpset (x86-64 only)

On Mon, 2019-01-28 at 11:29 -0600, Gage Eads wrote:

quoted

This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <redacted>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 31 +++++++++++
 lib/librte_eal/common/include/generic/rte_atomic.h | 65
++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h

b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..b7b90b83e 100644

--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h

@@ -34,6 +34,7 @@

 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */

@@ -46,6 +47,7 @@

 #include <stdint.h>
 #include <rte_common.h>
+#include <rte_compat.h>
 #include <rte_atomic.h>

 /*------------------------- 64 bit atomic operations
------------------------ -*/ @@ -208,4 +210,33 @@ static inline void
rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif

+static inline int __rte_experimental

__rte_always_inline?

quoted

+rte_atomic128_cmpset(volatile rte_int128_t *dst,

No need to declare the location volatile. Volatile doesn't do what you think
it
does.
https://youtu.be/lkgszkPnV8g?t=1027

I made this volatile to match the existing rte_atomicN_cmpset definitions,
which presumably have a good reason for using the keyword. Maintainers, any
input here?

quoted

+		     rte_int128_t *exp,

I would declare 'exp' const as well and document that 'exp' is not updated
(with
the old value) for a failure. The reason being that ARMv8.0/AArch64 cannot
atomically read the old value without also writing the location and that is
bad
for performance (unnecessary writes leads to unnecessary contention and
worse scalability). And the user must anyway read the location (in the start
of
the critical section) using e.g. non-atomic 64-bit reads so there isn't
actually any
requirement for an atomic 128-bit read of the location.

Will change in v2.

quoted

 rte_int128_t *src,

const rte_int128_t *src?

Sure, I don't see any harm in using const.

quoted


But why are we not passing 'exp' and 'src' by value? That works great, even
with
structs. Passing by value simplifies the compiler's life, especially if the
call is
inlined. Ask a compiler developer.

I ran objdump on the nb_stack code with both approaches, and pass-by-reference 
resulted in fewer overall x86_64 assembly ops.
PBV: 100 ops for push, 97 ops for pop
PBR: 92 ops for push, 84 ops for pop

OK I have never checked x86_64 code generation... I have good experiences with
ARM/AArch64, everything seems to be done using registers. I am surprised there
is a difference.

Did a quick check with lfring, passing 'src' (third param) by reference and by
value. No difference in code generation on x86_64.

But if you insist let's go with PBR.

(Using the in-progress v5 nb_stack code)

Another factor -- though much less compelling -- is that with pass-by-
reference, the user can create a 16B structure and cast it to rte_int128_t
when they call rte_atomic128_cmpset, whereas with pass-by-value they need to
put that struct in a union with rte_int128_t.

Which is what I always do nowadays... Trying to use as few casts as possible and
lie to the compiler as seldom as possible. But I can see the freedom provided by
taking a pointer to something and cast it it rte_int128_t ptr in the call
to rte_atomic128_cmpset().

Would prefer a name that is more similar to __atomic_compare_exchange(). E.g.
rte_atomic128_compare_exchange() (or perhaps just rte_atomic128_cmpxchg)? All
the rte_atomicXX_cmpset() functions do not take any memory order parameters.
From an Arm perspective, we are not happy with that.

quoted

+		     unsigned int weak,
+		     enum rte_atomic_memmodel_t success,
+		     enum rte_atomic_memmodel_t failure) {
+	RTE_SET_USED(weak);
+	RTE_SET_USED(success);
+	RTE_SET_USED(failure);
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (dst->val[0]),
+			"=A" (exp->val[0]),
+			[res] "=r" (res)
+		      : "c" (src->val[1]),
+			"b" (src->val[0]),
+			"m" (dst->val[0]),
+			"d" (exp->val[1]),
+			"a" (exp->val[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h

b/lib/librte_eal/common/include/generic/rte_atomic.h
index b99ba4688..8d612d566 100644

--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h

@@ -14,6 +14,7 @@

 #include <stdint.h>
 #include <rte_common.h>
+#include <rte_compat.h>

 #ifdef __DOXYGEN__

@@ -1082,4 +1083,68 @@ static inline void

rte_atomic64_clear(rte_atomic64_t
*v)
 }
 #endif

+/*------------------------ 128 bit atomic operations
+------------------------
-*/
+
+/**
+ * 128-bit integer structure.
+ */
+typedef struct {
+	uint64_t val[2];
+} __rte_aligned(16) rte_int128_t;

So we can't use __int128?

I'll put it in a union with val[2], in case any implementations want to use
it.

Thinking on this one more time, since the inline asm functions (e.g. for x86_64
cmpxchg16b and for AArch64 LDXP/STXP) anyway will use 64-bit registers, it makes
most sense to make rte_int128_t a struct of 2x64b. The question is whether to
use an array like above or a struct with two elements (which I normally do
internally). Can you compare code generation with the following definition?
typedef struct {
        uint64_t lo, hi;
} __rte_aligned(16) rte_int128_t;

Thanks,
Gage

[snip]

-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help