[PATCH 7/9] powerpc32: optimise csum_partial() loop

[PATCH 0/9] powerpc32: set of optimisation of network checksum functions · Christophe Leroy <hidden> · 2015-09-22
[PATCH 1/9] powerpc: unexport csum_tcpudp_magic · Christophe Leroy <hidden> · 2015-09-22
[PATCH 2/9] powerpc: mark xer clobbered in csum_add() · Christophe Leroy <hidden> · 2015-09-22
[PATCH 4/9] powerpc: inline ip_fast_csum() · Christophe Leroy <hidden> · 2015-09-22
Re: [PATCH 4/9] powerpc: inline ip_fast_csum() · Denis Kirjanov <hidden> · 2015-09-23
Re: [PATCH 4/9] powerpc: inline ip_fast_csum() · Christophe Leroy <hidden> · 2016-02-29
Re: [4/9] powerpc: inline ip_fast_csum() · Scott Wood <oss@buserror.net> · 2016-03-05
[PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() · Christophe Leroy <hidden> · 2015-09-22
Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() · Scott Wood <hidden> · 2015-10-23
Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() · Christophe Leroy <hidden> · 2016-02-29
[PATCH 9/9] powerpc: optimise csum_partial() call when len is constant · Christophe Leroy <hidden> · 2015-09-22
Re: [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant · Scott Wood <hidden> · 2015-10-23
Re: [9/9] powerpc: optimise csum_partial() call when len is constant · Scott Wood <oss@buserror.net> · 2016-03-05
[PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 · Christophe Leroy <hidden> · 2015-09-22
Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 · Scott Wood <hidden> · 2015-10-23
Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 · Christophe Leroy <hidden> · 2016-02-29
[PATCH 7/9] powerpc32: optimise csum_partial() loop · Christophe Leroy <hidden> · 2015-09-22
[PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() · Christophe Leroy <hidden> · 2015-09-22
[PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers · Christophe Leroy <hidden> · 2015-09-22
Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers · Scott Wood <hidden> · 2015-10-23
Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers · Anton Blanchard <hidden> · 2015-10-28
Re: [PATCH 0/9] powerpc32: set of optimisation of network checksum functions · David Miller <davem@davemloft.net> · 2015-09-23

STALE3735d

From: Christophe Leroy <hidden>
Date: 2015-09-22 14:35:50
Also in: lkml, netdev
Subsystem: linux for powerpc (32-bit and 64-bit), the rest · Maintainers: Madhavan Srinivasan, Michael Ellerman, Linus Torvalds

On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallele execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallele execution)

Signed-off-by: Christophe Leroy <redacted>
---
 arch/powerpc/lib/checksum_32.S | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 9c12602..0d34f47 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S

@@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
 	srwi.	r6,r4,2		/* # words to do */
 	adde	r5,r5,r0
 	beq	3f
-1:	mtctr	r6
+1:	andi.	r6,r6,3		/* Prepare to handle words 4 by 4 */
+	beq	21f
+	mtctr	r6
 2:	lwzu	r0,4(r3)
 	adde	r5,r5,r0
 	bdnz	2b
+21:	srwi.	r6,r4,4		/* # blocks of 4 words to do */
+	beq	3f
+	mtctr	r6
+22:	lwz	r0,4(r3)
+	lwz	r6,8(r3)
+	lwz	r7,12(r3)
+	lwzu	r8,16(r3)
+	adde	r5,r5,r0
+	adde	r5,r5,r6
+	adde	r5,r5,r7
+	adde	r5,r5,r8
+	bdnz	22b
 3:	andi.	r0,r4,2
 	beq+	4f
 	lhz	r0,4(r3)

-- 
2.1.0

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help