Re: [PATCH 1/3] riscv: optimized memcpy
From: Matteo Croce <hidden>
Date: 2021-06-16 19:07:16
Also in:
linux-riscv, lkml
On Wed, Jun 16, 2021 at 10:24 AM David Laight [off-list ref] wrote:
From: Matteo Crocequoted
Sent: 16 June 2021 03:02...quoted
quoted
quoted
That's a good idea, but if you read the replies to Gary's original patch https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/ (local) .. both Gary, Palmer and David would rather like a C-based version. This is one attempt at providing that.Yep, I prefer C as well :) But if you check commit 04091d6, the assembly version was introduced for KASAN. So if we are to change it back to C, please make sure KASAN is not broken....quoted
Leaving out the first memcpy/set of every test which is always slower, (maybe because of a cache miss?), the current implementation copies 260 Mb/s when the low order bits match, and 114 otherwise. Memset is stable at 278 Mb/s. Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed, and 230 Mb/s otherwise. Memset is the same as the current one.Any idea what the attainable performance is for the cpu you are using? Since both memset and memcpy are running at much the same speed I suspect it is all limited by the writes. 272MB/s is only 34M writes/sec. This seems horribly slow for a modern cpu. So is this actually really limited by the cache writes to physical memory? You might want to do some tests (userspace is fine) where you check much smaller lengths that definitely sit within the data cache.
I get similar results in userspace, this tool write to RAM with variable data width: root@beaglev:~/src# ./unalign_check 1 0 1 size: 1 Mb write size: 8 bit unalignment: 0 byte elapsed time: 0.01 sec throughput: 124.36 Mb/s # ./unalign_check 1 0 8 size: 1 Mb write size: 64 bit unalignment: 0 byte elapsed time: 0.00 sec throughput: 252.12 Mb/s
It is also worth checking how much overhead there is for short copies - they are almost certainly more common than you might expect. This is one problem with excessive loop unrolling - the 'special cases' for the ends of the buffer start having a big effect on small copies.
I too believe that they are much more common than long ones. Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16. Or having it dependend on the word size, e.g. sizeof(long) * 2. Suggestions?
For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)Regards, -- per aspera ad upstream