RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: David Laight <hidden>
Date: 2013-10-30 14:06:08
Also in:
lkml
...
and then I also wanted to try using both xmm and ymm registers and doing 64bit adds with 32bit numbers across multiple xmm/ymm registers as that should parallel nicely. David, you mentioned you've tried this, how did your experiment turn out and what was your method? I was planning on doing regular full size loads into one xmm/ymm register, then using pshufd/vshufd to move the data into two different registers, then summing into a fourth register, and possible running two of those pipelines in parallel.
It was a long time ago, and IIRC the code was just SSE so the register length just wasn't going to give the required benefit. I know I wrote the code, but I can't even remember whether I actually got it working! With the longer AVX words it might make enough difference. Of course, this assumes that you have the fpu registers available. If you have to do a fpu context switch it will be a lot slower. About the same time I did manage to an open coded copy loop to run as fast as 'rep movs' - and without any unrolling or any prefetch instructions. Thinking about AVX you should be able to do (without looking up the actual mnemonics): load add 32bit chunks to sum compare sum with read value (equiv of carry) add/subtract compare result (0 or ~0) to a carry-sum register That is 4 instructions for 256 bits, so you can aim for 4 clocks. You'd need to check the cpu book to see if any of those can be scheduled at the same time (if not dependant). (and also whether there is any result delay - don't think so.) I'd try running two copies of the above - probably skewed so that the memory accesses are separated, do the memory read for the next iteration, and use the 3rd instruction unit for loop control. David