RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: David Laight <hidden>
Date: 2013-11-04 09:50:59
Also in:
lkml
quoted
I think you need 3 instructions, move a 0, conditionally move a 1 then add. I suspect it won't be a win!
Or, with an appropriately unrolled loop, for each word: zero %eax, cmove a 1 to %al cmove a 1 to %ah shift %eax left, cmove a 1 to %al cmove a 1 to %ah, add %eax onto somewhere. However the 2nd instruction stream would have to use a different register (IIRC 8bit updates depend on the entire register).
I agree, that sounds interesting, but very cpu dependent. Thanks for the suggestion, Ben, but I think it would be better if we just did the prefetch here and re-addressed this area when AVX (or addcx/addox) instructions were available for testing on hardware.
I didn't look too closely at the original figures. With a simple loop you need 4 instructions per iteration (load, adc, inc, branch). How close to one iteration per clock do you get? I thought x86 hardware prefetch would load the cache lines for sequential accesses - so any prefetch instructions are rather pointless. However reading the value in the previous loop iteration should help. I've just realised that there is a problem with the loop termination condition also needing the flags register:-( I don't remember the 'loop' instruction ever being added to any of the fast path instruction decodes - so it won't help. So I suspect the best you'll get is an interleaved sequence of load and adc with an lea and inc (both to adjust the index) and a bne back to the top. (the lea wants to be in the middle somewhere). That might manage 1 clock per word + 1 clock per loop iteration (if the inc and bne can be 'fused'). David