Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation

[PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Albrecht Dreß <hidden> · 2009-05-27
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Joakim Tjernlund <hidden> · 2009-05-28
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Albrecht Dreß <hidden> · 2009-05-28
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Joakim Tjernlund <hidden> · 2009-05-29
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Albrecht Dreß <hidden> · 2009-05-31
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Joakim Tjernlund <hidden> · 2009-06-01
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Albrecht Dreß <hidden> · 2009-06-02
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2009-06-02
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Kenneth Johansson <hidden> · 2009-06-03
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Albrecht Dreß <hidden> · 2009-06-03
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Wolfram Sang <hidden> · 2009-06-11
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Grant Likely <hidden> · 2009-06-11
Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation · Lorenz Kolb <hidden> · 2009-06-19

From: Kenneth Johansson <hidden>
Date: 2009-06-03 14:36:43

On Wed, 2009-06-03 at 08:51 +1000, Benjamin Herrenschmidt wrote:

On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:

quoted

which drops the r1 accesses, but still produces the sub-optimal loop.   
Is this a gcc regression, or did I miss something here?  Probably the  
only bullet-proof way is to write some core loops in assembly... :-/

Well, gcc may be right here. What you call the "optimal" loop uses the
lwzu instruction. An interesting thing about this instruction is that
it updates two GPRs at completion (I'm ignoring the load multiple and
string instructions on purpose here).

I wouldn't be surprised thus if the loop variant with the separate add
ends up more efficient on most implementations around.

On an e300 core using the lwzu/stwu is about 20% faster so at least one
core prefer that optimization.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help