Re: [PATCH] powerpc: tiny memcpy_(to|from)io optimisation
From: Kenneth Johansson <hidden>
Date: 2009-06-03 14:36:43
From: Kenneth Johansson <hidden>
Date: 2009-06-03 14:36:43
On Wed, 2009-06-03 at 08:51 +1000, Benjamin Herrenschmidt wrote:
On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:quoted
which drops the r1 accesses, but still produces the sub-optimal loop. Is this a gcc regression, or did I miss something here? Probably the only bullet-proof way is to write some core loops in assembly... :-/Well, gcc may be right here. What you call the "optimal" loop uses the lwzu instruction. An interesting thing about this instruction is that it updates two GPRs at completion (I'm ignoring the load multiple and string instructions on purpose here).
I wouldn't be surprised thus if the loop variant with the separate add ends up more efficient on most implementations around.
On an e300 core using the lwzu/stwu is about 20% faster so at least one core prefer that optimization.