Thread (11 messages) 11 messages, 3 authors, 2018-06-06

Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization

From: Naveen N. Rao <hidden>
Date: 2018-06-06 06:36:22

Simon Guo wrote:
Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
quoted
Hi Simon,
=20
wei.guo.simon@gmail.com writes:
quoted
From: Simon Guo <redacted>

There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginni=
ng,
quoted
quoted
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparisio=
n,
quoted
quoted
currently the threshold is set for 4K bytes. Notes the VMX instruction=
s
quoted
quoted
will lead to VMX regs save/load penalty. This patch set includes a
patch to add a 32 bytes pre-checking to minimize the penalty.

It did the similar with glibc commit dec4a7105e (powerpc: Improve memc=
mp=20
quoted
quoted
performance for POWER8). Thanks Cyril Bur's information.
This patch set also updates memcmp selftest case to make it compiled a=
nd
quoted
quoted
incorporate large size comparison case.
=20
I'm seeing a few crashes with this applied, I haven't had time to look
into what is happening yet, sorry.
=20
=20
The bug is due to memcmp() invokes a C function enter_vmx_ops() who will =
load=20
some PIC value based on r2.
=20
memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
itself, everything is fine. But if memcmp() is invoked from modules[test_=
user_copy],=20
r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() =
will refer=20
quoted hunk ↗ jump to hunk
to an incorrect/unexisting data location based on wrong r2 value.
=20
Following patch will fix this issue:
------------
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5eba49744a5a..24d093fa89bb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -102,7 +102,7 @@
  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
  * are named like .Ldiffoffset_xxxx
  */
-_GLOBAL(memcmp)
+_GLOBAL_TOC(memcmp)
        cmpdi   cr1,r5,0
=20
        /* Use the short loop if the src/dst addresses are not
----------
=20
It means the memcmp() fun entry will have additional 2 instructions. Is t=
here
any way to save these 2 instructions when the memcmp() is actually invoke=
d
from kernel itself?
That will be the case. We will end up entering the function via the=20
local entry point skipping the first two instructions. The Global entry=20
point is only used for cross-module calls.

- Naveen

=
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help