Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop
From: Segher Boessenkool <hidden>
Date: 2015-08-06 23:25:48
Also in:
lkml
On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote:
quoted
The original loop was already optimal, as the comment said.The comment says that bdnz has zero overhead. That doesn't mean the adde won't stall waiting for the load result.
adde is execution serialising on those cores; it *always* stalls, that is, it won't run until it is next to complete.
quoted
The new code adds extra instructions and a mispredicted branch.Outside the main loop.
Sure, I never said it was super-bad or anything.
quoted
You also might get less overlap between the loads and adde (I didn't check if there is any originally): those instructions are no longer interleaved. I think it is a stupid idea to optimise code for all 32-bit PowerPC CPUs based on solely what is best for a particularly simple, slow implementation; and that is what this patch is doing.The simple and slow implementation is the one that needs optimizations the most.
And, on the other hand, optimising for atypical (mostly) in-order single-issue chips without branch folding, hurts performance on other chips the most. Well, dual-issue in-order might be worse :-P
If this makes performance non-negligibly worse on other 32-bit chips, and is an important improvement on 8xx, then we can use an ifdef since 8xx already requires its own kernel build. I'd prefer to see a benchmark showing that it actually does make things worse on those chips, though.
And I'd like to see a benchmark that shows it *does not* hurt performance on most chips, and does improve things on 8xx, and by how much. But it isn't *me* who has to show that, it is not my patch. If these csum routines actually matter for performance that much, there really *should* be chip-specific implementations. Segher