Re: [PATCH 1/3] riscv: optimized memcpy

From: Akira Tsukamoto <hidden>
Date: 2021-06-16 10:48:39
Also in: linux-riscv, lkml

On Wed, Jun 16, 2021 at 5:24 PM David Laight [off-list ref] wrote:

From: Matteo Croce

quoted

Sent: 16 June 2021 03:02

...

quoted

That's a good idea, but if you read the replies to Gary's original
patch
https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ (local)
.. both Gary, Palmer and David would rather like a C-based version.
This is one attempt at providing that.

Yep, I prefer C as well :)

But if you check commit 04091d6, the assembly version was introduced
for KASAN. So if we are to change it back to C, please make sure KASAN
is not broken.

...

quoted

Leaving out the first memcpy/set of every test which is always slower, (maybe
because of a cache miss?), the current implementation copies 260 Mb/s when
the low order bits match, and 114 otherwise.
Memset is stable at 278 Mb/s.

Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
and 230 Mb/s otherwise. Memset is the same as the current one.

Any idea what the attainable performance is for the cpu you are using?
Since both memset and memcpy are running at much the same speed
I suspect it is all limited by the writes.

272MB/s is only 34M writes/sec.
This seems horribly slow for a modern cpu.
So is this actually really limited by the cache writes to physical memory?

You might want to do some tests (userspace is fine) where you
check much smaller lengths that definitely sit within the data cache.

It is also worth checking how much overhead there is for
short copies - they are almost certainly more common than
you might expect.
This is one problem with excessive loop unrolling - the 'special
cases' for the ends of the buffer start having a big effect
on small copies.

For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

Overhead  Shared O  Symbol
 42.22%  [kernel]  [k] memcpy
 35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

The Gary's assembly version of memcpy is improving by not using unaligned
access in 64 bit boundary, uses shifting it after reading with offset of
aligned access, because every misaligned access is trapped and switches to
opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira

        David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help