Thread (47 messages) 47 messages, 9 authors, 2013-11-04

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: David Laight <hidden>
Date: 2013-10-30 14:06:08
Also in: lkml

...
and then I also wanted to try using both xmm and ymm registers and doing
64bit adds with 32bit numbers across multiple xmm/ymm registers as that
should parallel nicely.  David, you mentioned you've tried this, how did
your experiment turn out and what was your method?  I was planning on
doing regular full size loads into one xmm/ymm register, then using
pshufd/vshufd to move the data into two different registers, then
summing into a fourth register, and possible running two of those
pipelines in parallel.
It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.

About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.

Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
	load
	add 32bit chunks to sum
	compare sum with read value (equiv of carry)
	add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)

I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.

	David
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help