RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

(off-list ancestor, not in this archive)
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-17
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · "H. Peter Anvin" <hpa@zytor.com> · 2013-10-17
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Eric Dumazet <hidden> · 2013-10-17
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-18
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Ahern <hidden> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-28
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-29
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Doug Ledford <hidden> · 2013-10-30
RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Laight <hidden> · 2013-10-30
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-30
RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Laight <hidden> · 2013-10-30
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Doug Ledford <hidden> · 2013-10-30
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Doug Ledford <hidden> · 2013-10-30
RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Laight <hidden> · 2013-10-30
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-30
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-31
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ben Hutchings <hidden> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ben Hutchings <hidden> · 2013-11-01
RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Laight <hidden> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Joe Perches <joe@perches.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Joe Perches <joe@perches.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-11-02
RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Laight <hidden> · 2013-11-04
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-10-31
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-10-31
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Ingo Molnar <mingo@kernel.org> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · Neil Horman <nhorman@tuxdriver.com> · 2013-11-01
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's · David Ahern <hidden> · 2013-10-29

From: David Laight <hidden>
Date: 2013-11-04 09:50:59
Also in: lkml

quoted

I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!

Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).

I agree, that sounds interesting, but very cpu dependent.  Thanks for the
suggestion, Ben, but I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.

I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help