Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'
From: Eric Dumazet <edumazet@google.com>
Date: 2021-11-26 18:34:51
Also in:
lkml, oe-kbuild-all
On Fri, Nov 26, 2021 at 9:18 AM David Laight [off-list ref] wrote:
From: Eric Dumazetquoted
Sent: 25 November 2021 04:01...quoted
quoted
The outputs seem to match if `buff` is aligned to 64-bit. Still see difference with `csum_fold(csum_partial())` if `buff` is not 64-bit aligned. The comment at the top says it's "best" to have `buff` 64-bit aligned but the code logic seems meant to support the misaligned case so not sure if it's an issue.It is an issue in general, not in standard cases because network headers are aligned. I think it came when I folded csum_partial() and do_csum(), I forgot to ror() the seed. I suspect the following would help:diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index 1eb8f2d11f7c785be624eba315fe9ca7989fd56d..ee7b0e7a6055bcbef42d22f7e1d8f52ddbd6be6d100644--- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c@@ -41,6 +41,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) if (unlikely(odd)) { if (unlikely(len == 0)) return sum; + temp64 = ror32((__force u64)sum, 8); temp64 += (*(unsigned char *)buff << 8); len--; buff++;You can save an instruction (as if this path matters) by: temp64 = sum + *(unsigned char *)buff;
This might overflow, sum is 32bit. Given that we have temp64 = (__force u64)sum; already done at this point, we can instead temp64 += *(u8 *)buff;
temp64 <<= 8;
Although that probably falls foul of 64bit shifts being slow.
Are they slower than the ror32(sum, 8) ?
So maybe just:
sum += *(unsigned char *)buff;we might miss a carry/overflow here
temp64 = bswap32(sum);
AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be the same speed but may use different execution units. Intel seem so have managed to slow down ror(x, %cl) to 3 clocks in sandy bridge - and still not fixed it. Although the compiler might be making a pigs-breakfast of the register allocation when you tried setting 'odd = 8'. Weeks can be spent fiddling with this code :-(
Yes, and in the end, it won't be able to compete with a specialized/inlined ipv6_csum_partial() _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um