Thread (11 messages) 11 messages, 3 authors, 2015-08-17

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

From: Segher Boessenkool <hidden>
Date: 2015-08-06 23:25:48
Also in: lkml

On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote:
quoted
The original loop was already optimal, as the comment said.
The comment says that bdnz has zero overhead.  That doesn't mean the adde 
won't stall waiting for the load result.
adde is execution serialising on those cores; it *always* stalls,
that is, it won't run until it is next to complete.
quoted
The new code adds extra instructions and a mispredicted branch.
Outside the main loop.
Sure, I never said it was super-bad or anything.
quoted
  You also might get less overlap between the loads and adde (I didn't check
if there is any originally): those instructions are no longer
interleaved.

I think it is a stupid idea to optimise code for all 32-bit PowerPC
CPUs based on solely what is best for a particularly simple, slow
implementation; and that is what this patch is doing.
The simple and slow implementation is the one that needs optimizations the 
most.
And, on the other hand, optimising for atypical (mostly) in-order
single-issue chips without branch folding, hurts performance on
other chips the most.  Well, dual-issue in-order might be worse :-P
If this makes performance non-negligibly worse on other 32-bit chips, and is 
an important improvement on 8xx, then we can use an ifdef since 8xx already 
requires its own kernel build.  I'd prefer to see a benchmark showing that it 
actually does make things worse on those chips, though.
And I'd like to see a benchmark that shows it *does not* hurt performance
on most chips, and does improve things on 8xx, and by how much.  But it
isn't *me* who has to show that, it is not my patch.

If these csum routines actually matter for performance that much, there
really *should* be chip-specific implementations.


Segher
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help