Thread (47 messages) 47 messages, 9 authors, 2013-11-04

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: David Laight <hidden>
Date: 2013-11-04 09:50:59
Also in: lkml

quoted
I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!
Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).
I agree, that sounds interesting, but very cpu dependent.  Thanks for the
suggestion, Ben, but I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.
I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help