RE: [PATCH] lib/string: Bring optimized memcmp from glibc

From: David Laight <hidden>
Date: 2021-07-23 14:02:47
Also in: lkml

From: Linus Torvalds

Sent: 21 July 2021 19:46

On Wed, Jul 21, 2021 at 11:17 AM Nikolay Borisov [off-list ref] wrote:

quoted

I find it somewhat arbitrary that we choose to align the 2nd pointer and
not the first.

Yeah, that's a bit odd, but I don't think it matters.

The hope is obviously that they are mutually aligned, and in that case
it doesn't matter which one you aim to align.

quoted

So you are saying that the current memcmp could indeed use improvement
but you don't want it to be based on the glibc's code due to the ugly
misalignment handling?

Yeah. I suspect that this (very simple) patch gives you the same
performance improvement that the glibc code does.

NOTE! I'm not saying this patch is perfect. This one doesn't even
_try_ to do the mutual alignment, because it's really silly. But I'm
throwing this out here for discussion, because

 - it's really simple

 - I suspect it gets you 99% of the way there

 - the code generation is actually quite good with both gcc and clang.
This is gcc:

        memcmp:
                jmp     .L60
        .L52:
                movq    (%rsi), %rax
                cmpq    %rax, (%rdi)
                jne     .L53
                addq    $8, %rdi
                addq    $8, %rsi
                subq    $8, %rdx
        .L60:
                cmpq    $7, %rdx
                ja      .L52

I wonder how fast that can be made to run.
I think the two conditional branches have to run in separate clocks.
So you may get all 5 arithmetic operations to run in the same 2 clocks.
But that may be pushing things on everything except the very latest cpu.
The memory reads aren't limiting at all, the cpu can do two per clock.
So even though (IIRC) misaligned ones cost an extra clock it doesn't matter.

That looks like a +dst++ = *src++ loop.
The array copy dst[i] = src[i]; i++ requires one less 'addq'
provided the cpu has 'register + register' addressing.
Not decrementing the length also saves an 'addq'.
So the loop:
	for (i = 0; i < length - 7; i += 8)
		dst[i] = src[i];  /* Hacked to be right in C */
probably only has one addq and one cmpq per iteration.
That is much more likely to run in the 2 clocks.
(If you can persuade gcc not to transform it!)

It may also be possible to remove the cmpq by arranging
that the flags from the addq contain the right condition.
That needs something like:
	dst += len; src += len; len = -len
	do
		dst[len] = src[len];
	while ((len += 8) < 0);
That probably isn't necessary for x86, but is likely to help sparc.

For mips-like cpu (with 'compare and jump', only 'reg + constant'
addressing) you really want a loop like:
	dst_end = dst + length;
	do
		*dst++ = *src++;
	while (dst < dst_end);
This has two adds and a compare per iteration.
That might be a good compromise for aligned copies.

I'm not at all sure is it ever worth aligning either pointer
if misaligned reads don't fault.
Most compares (of any size) will be aligned.
So you get the 'hit' of the test when it cannot help.
That almost certainly exceeds any benefit.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help