Thread (11 messages) 11 messages, 6 authors, 2025-09-16

Re: [PATCH net-next V2] net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs

From: Patrisious Haddad <hidden>
Date: 2025-09-16 08:39:20
Also in: linux-arch, linux-patches, linux-rdma, linux-s390, lkml, llvm

On 9/16/2025 2:15 AM, Nathan Chancellor wrote:
External email: Use caution opening links or attachments


On Mon, Sep 15, 2025 at 07:48:10PM -0300, Jason Gunthorpe wrote:
quoted
On Mon, Sep 15, 2025 at 03:27:58PM -0700, Nathan Chancellor wrote:
quoted
On Mon, Sep 15, 2025 at 03:18:59PM -0700, Nathan Chancellor wrote:
quoted
On Mon, Sep 15, 2025 at 11:35:08AM +0300, Tariq Toukan wrote:
...
quoted
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index d77696f46eb5..06d0eb190816 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -176,3 +176,9 @@ mlx5_core-$(CONFIG_PCIE_TPH) += lib/st.o

  obj-$(CONFIG_MLX5_DPLL) += mlx5_dpll.o
  mlx5_dpll-y := dpll.o
+
+#
+# NEON WC specific for mlx5
+#
+mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
+FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
Does this work as is? I think this needs to be CFLAGS instead of FLAGS
but I did not test to verify.
Also, Documentation/core-api/floating-point.rst states that code should
also use CFLAGS_REMOVE_ for CC_FLAGS_NO_FPU as well as adding
CC_FLAGS_FPU.

   CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
I wondered if you needed the seperate compilation unit at all since it
it all done with inline assembly.. Since the makefile seems to have a
typo, it suggests you don't need the compilation unit and it could
just be a little inline protected by CONFIG_KERNEL_MODE_NEON.
There is difference between what actually compiles and the effect of 
these flags on actual performance/assembly translation. To avoid finding 
that the hard way I prefer to stick to their documentation which does as 
Natan described below,

a separate compilation unit between begin and end and the correct flags 
- and eventually that was what I tested , I missed to re-test this post 
finishing my code review - thinking my changes were only cosmetic ...
quoted hunk ↗ jump to hunk
Hmmm, clang rejects the current patch

   drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:3: error: instruction requires: neon
       9 |         ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
         |          ^
   <inline asm>:1:2: note: instantiated into assembly here
       1 |         ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x19]
         |         ^
   drivers/net/ethernet/mellanox/mlx5/core/lib/wc_neon_iowrite64_copy.c:9:48: error: instruction requires: neon
       9 |         ("ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0]\n\t"
         |                                                       ^
   <inline asm>:2:2: note: instantiated into assembly here
       2 |         st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x20]
         |         ^

while GCC accepts it... It looks like GCC's -mgeneral-regs-only only
impacts the compiler using floating-point and SIMD registers after [1]
in GCC 6.x, whereas clang's restriction is on both the compiler and
assembler. Perhaps clang should be adjusted to match but its behavior
seems more desirable for the kernel to ensure floating-point code is
properly separated and called between kernel_fpu_{begin,end}(). This
error is resolved with the following diff.

[1]: https://gcc.gnu.org/cgit/gcc/commit/?id=7d9425d46b58e69667300331aa55ebddddcceaeb

Cheers,
Nathan
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 06d0eb190816..a85fc21419d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -181,4 +181,5 @@ mlx5_dpll-y :=      dpll.o
  # NEON WC specific for mlx5
  #
  mlx5_core-$(CONFIG_KERNEL_MODE_NEON) += lib/wc_neon_iowrite64_copy.o
-FLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
+CFLAGS_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_FPU)
+CFLAGS_REMOVE_lib/wc_neon_iowrite64_copy.o += $(CC_FLAGS_NO_FPU)
You are spot on, I checked my patchset and the actual tested code 
(performance wise) beyond compilation used the following code:

ifeq ($(ARCH),arm64)
         CFLAGS_lib/neon_iowrite64_copy.o += -ffreestanding
         CFLAGS_REMOVE_lib/neon_iowrite64_copy.o += -mgeneral-regs-only
endif

Which is actually equivalent to the diff you sent, Thanks for the 
heads-up will fix and resend.

Thanks, Patrisious.

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help