RE: [PATCH v2 0/3] lib/string: optimized mem* functions
From: David Laight <hidden>
Date: 2021-07-12 09:04:23
Also in:
linux-riscv, lkml
From: Matteo Croce
Sent: 11 July 2021 00:08 On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton [off-list ref] wrote:quoted
On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce [off-list ref] wrote:quoted
From: Matteo Croce <redacted> Rewrite the generic mem{cpy,move,set} so that memory is accessed with the widest size possible, but without doing unaligned accesses. This was originally posted as C string functions for RISC-V[1], but as there was no specific RISC-V code, it was proposed for the generic lib/string.c implementation. Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE} and HAVE_EFFICIENT_UNALIGNED_ACCESS. These are the performances of memcpy() and memset() of a RISC-V machine on a 32 mbyte buffer: memcpy: original aligned: 75 Mb/s original unaligned: 75 Mb/s new aligned: 114 Mb/s new unaligned: 107 Mb/s memset: original aligned: 140 Mb/s original unaligned: 140 Mb/s new aligned: 241 Mb/s new unaligned: 241 Mb/sDid you record the x86_64 performance? Which other architectures are affected by this change?x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY and has optimized implementations in arch/x86/lib. Anyway, I was curious and I tested them on x86_64 too, there was zero gain over the generic ones.
x86 performance (and attainable performance) does depend on the cpu micro-archiecture. Any recent 'desktop' intel cpu will almost certainly manage to re-order the execution of almost any copy loop and attain 1 write per clock. (Even the trivial 'while (count--) *dest++ = *src++;' loop.) The same isn't true of the Atom based cpu that may be on small servers. Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited out-of-order execution and so are much more sensitive to instruction ordering. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)