Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for... | linuxppc-dev

[PATCH v2 0/3] powerpc/64: memcmp() optimization · <hidden> · 2017-09-22
[PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp(). · <hidden> · 2017-09-22
[PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · <hidden> · 2017-09-22
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Simon Guo <hidden> · 2017-09-22
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Cyril Bur <hidden> · 2017-09-22
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Simon Guo <hidden> · 2017-09-25
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Cyril Bur <hidden> · 2017-09-25
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Michael Ellerman <mpe@ellerman.id.au> · 2017-09-26
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Segher Boessenkool <hidden> · 2017-09-26
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Michael Ellerman <mpe@ellerman.id.au> · 2017-09-27
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Segher Boessenkool <hidden> · 2017-09-27
RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · David Laight <hidden> · 2017-09-27
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Simon Guo <hidden> · 2017-09-28
RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · David Laight <hidden> · 2017-09-28
Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · Simon Guo <hidden> · 2017-09-28
[PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation · <hidden> · 2017-09-22
RE: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation · David Laight <hidden> · 2017-09-25
Re: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation · Simon Guo <hidden> · 2017-09-25

Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

From: Cyril Bur <hidden>
Date: 2017-09-25 23:59:53

On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:

Hi Cyril,
On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:

quoted

On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:

quoted

From: Simon Guo <redacted>

This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.

Hi Simon,

Sorry I didn't see this sooner, I've actually been working on a kernel
version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
for POWER8) unfortunately I've been distracted and it still isn't done.

Thanks for sync with me. Let's consolidate our effort together :)

I have a quick check on glibc commit dec4a7105e. 
Looks the aligned case comparison with VSX is launched without rN size
limitation, which means it will have a VSX reg load penalty even when the 
length is 9 bytes.

This was written for userspace which doesn't have to explicitly enable
VMX in order to use it - we need to be smarter in the kernel.

It did some optimization when src/dest addrs don't have the same offset 
on 8 bytes alignment boundary. I need to read more closely.

quoted

I wonder if we can consolidate our efforts here. One thing I did come
across in my testing is that for memcmp() that will fail early (I
haven't narrowed down the the optimal number yet) the cost of enabling
VMX actually turns out to be a performance regression, as such I've
added a small check of the first 64 bytes to the start before enabling
VMX to ensure the penalty is worth taking.

Will there still be a penalty if the 65th byte differs?

I haven't benchmarked it exactly, my rationale for 64 bytes was that it
is the stride of the vectorised copy loop so, if we know we'll fail
before even completing one iteration of the vectorized loop there isn't
any point using the vector regs.

quoted

Also, you should consider doing 4K and greater, KSM (Kernel Samepage
Merging) uses PAGE_SIZE which can be as small as 4K.

Currently the VMX will only be applied when size exceeds 4K. Are you
suggesting a bigger threshold than 4K?

Equal to or greater than 4K, KSM will benefit.

We can sync more offline for v3.

Thanks,
- Simon

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help