Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Joe Perches <joe@perches.com>
Date: 2013-11-01 20:26:57
Also in:
lkml
From: Joe Perches <joe@perches.com>
Date: 2013-11-01 20:26:57
Also in:
lkml
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:quoted
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:quoted
I think it would be better if we just did the prefetch here and re-addressed this area when AVX (or addcx/addox) instructions were available for testing on hardware.Could there be a difference if only a single software prefetch was done at the beginning of transfer before the while loop and hardware prefetches did the rest?I wouldn't think so. If hardware was going to do any prefetching based on memory access patterns it will do so regardless of the leading prefetch, and that first prefetch isn't helpful because we still wind up stalling on the adds while its completing
I imagine one benefit to be helping prevent prefetching beyond the actual data required. Maybe some hardware optimizes prefetch stride better than 5*64. I wonder also if using if (count > some_length) prefetch while (...) helps small lengths more than the test/jump cost.