RE: [PATCH] riscv: use the generic string routines

From: David Laight <hidden>
Date: 2021-08-05 08:20:24
Also in: linux-riscv, lkml

From: Palmer Dabbelt

Sent: 04 August 2021 21:40

On Tue, 03 Aug 2021 09:54:34 PDT (-0700), mcroce@linux.microsoft.com wrote:

quoted

On Mon, Jul 19, 2021 at 1:44 PM Matteo Croce [off-list ref] wrote:

quoted

From: Matteo Croce <redacted>

Use the generic routines which handle alignment properly.

These are the performances measured on a BeagleV machine for a
32 mbyte buffer:

memcpy:
original aligned:        75 Mb/s
original unaligned:      75 Mb/s
new aligned:            114 Mb/s
new unaligned:          107 Mb/s

memset:
original aligned:       140 Mb/s
original unaligned:     140 Mb/s
new aligned:            241 Mb/s
new unaligned:          241 Mb/s

TCP throughput with iperf3 gives a similar improvement as well.

This is the binary size increase according to bloat-o-meter:

add/remove: 0/0 grow/shrink: 4/2 up/down: 432/-36 (396)
Function                                     old     new   delta
memcpy                                        36     324    +288
memset                                        32     148    +116
strlcpy                                      116     132     +16
strscpy_pad                                   84      96     +12
strlcat                                      176     164     -12
memmove                                       76      52     -24
Total: Before=1225371, After=1225767, chg +0.03%

Signed-off-by: Matteo Croce <redacted>
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
---

Hi,

can someone have a look at this change and share opinions?

This LGTM.  How are the generic string routines landing?  I'm happy to
take this into my for-next, but IIUC we need the optimized generic
versions first so we don't have a performance regression falling back to
the trivial ones for a bit.  Is there a shared tag I can pull in?

I thought the actual problem was that the asm copy functions were
doing misaligned transfers and faulting.

There is no way that the simple C loop should be as fast as
the asm function given the delay cycles reading from memory.

You definitely need to test much smaller copies where the
buffers are resident in the L1 data cache.
Anything else is completely dominated by the cache line fills/spills.

You also need to test on the much faster riscv implementations
not just on the beaglev board.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help