Thread (43 messages) 43 messages, 9 authors, 2021-11-01

Re: [dpdk-dev] [PATCH v2 1/2] lib/eal: add amd epyc2 memcpy routine to eal

From: Thomas Monjalon <hidden>
Date: 2021-10-21 19:50:13

21/10/2021 21:03, Song, Keesang:
From: Thomas Monjalon <redacted>
quoted
21/10/2021 20:12, Song, Keesang:
quoted
From: Ananyev, Konstantin <redacted>
quoted
21/10/2021 19:10, Song, Keesang:
quoted
19/10/2021 17:35, Stephen Hemminger:
quoted
From: Thomas Monjalon <redacted>
quoted
19/10/2021 12:47, Aman Kumar:
quoted
This patch provides rte_memcpy* calls optimized for AMD EPYC
platforms. Use config/x86/x86_amd_epyc_linux_gcc as cross-file
with meson to build dpdk for AMD EPYC platforms.
Please split in 2 patches: platform & memcpy.

What optimization is specific to EPYC?

I dislike the asm code below.
What is AMD specific inside?
Can it use compiler intrinsics as it is done elsewhere?
And why is this not done by Gcc?
I hope this can make some explanation to your question.
We(AMD Linux library support team) have implemented the custom
tailored memcpy solution which is a close match with DPDK use case
requirements like the below.
1)      Min 64B length data packet with cache aligned
Source and Destination.
2)      Non-Temporal load and temporal store for cache aligned
source for both RX and TX paths.
Could not implement the non-temporal store for TX_PATH,
as non-Temporal load/stores works only with 32B aligned addresses
for AVX2
3)      This solution works for all AVX2 supported AMD machines.

Internally we have completed the integrity testing and benchmarking
of the solution and found gains of 8.4% to 14.5% specifically on
Milan CPU(3rd Gen of EPYC Processor)
It still not clear to me why it has to be written in assembler.
Why similar stuff can't be written in C with instincts, as rest of
rte_memcpy.h does?
The current memcpy implementation in Glibc is based out of assembly
coding.
Although memcpy could have been implemented with intrinsic,
but since our AMD library developers are working on the Glibc
functions, they have provided a tailored implementation based
out of inline assembly coding.
Please convert it to C code, thanks.
I've already asked our AMD tools team, but they're saying
they are not really familiar with C code implementation.
We need your approval for now since we really need to get
this patch submitted to 21.11 LTS.
Not sure it is urgent given that v2 came after the planned -rc1 date,
after 6 weeks of silence.
About the approval, there are already 3 technical board members
(Konstantin, Stephen and me) objecting against this patch.
Not being familiar with C code when working on CPU optimization
in 2021 is a strange argument.

In general, I don't really understand why we should maintain memcpy
functions in DPDK instead of relying on libc optimizations.
Having big asm code to maintain and debug is not helping.

I think this case shows that AMD needs to become more familiar
with DPDK schedule and expectations.
I would encourage you to contribute more in the project,
so such misunderstanding won't happen in future.

Hope that's all understandable


PS: discussion is more readable with replies below

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help