Thread (47 messages) 47 messages, 9 authors, 2013-11-04

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Joe Perches <joe@perches.com>
Date: 2013-11-01 20:26:57
Also in: lkml

On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
quoted
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
quoted
I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.
Could there be a difference if only a single software
prefetch was done at the beginning of transfer before
the while loop and hardware prefetches did the rest?
I wouldn't think so.  If hardware was going to do any prefetching based on
memory access patterns it will do so regardless of the leading prefetch, and
that first prefetch isn't helpful because we still wind up stalling on the adds
while its completing
I imagine one benefit to be helping prevent
prefetching beyond the actual data required.

Maybe some hardware optimizes prefetch stride
better than 5*64.

I wonder also if using

	if (count > some_length)
		prefetch
	while (...)

helps small lengths more than the test/jump cost.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help