Thread (23 messages) 23 messages, 6 authors, 2022-01-31

Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error: implicit declaration of function 'load_unaligned_zeropad'

From: Eric Dumazet <edumazet@google.com>
Date: 2021-11-26 18:34:51
Also in: lkml, oe-kbuild-all

On Fri, Nov 26, 2021 at 9:18 AM David Laight [off-list ref] wrote:
From: Eric Dumazet
quoted
Sent: 25 November 2021 04:01
...
quoted
quoted
The outputs seem to match if `buff` is aligned to 64-bit. Still see
difference with `csum_fold(csum_partial())` if `buff` is not 64-bit aligned.

The comment at the top says it's "best" to have `buff` 64-bit aligned but
the code logic seems meant to support the misaligned case so not
sure if it's an issue.
It is an issue in general, not in standard cases because network
headers are aligned.

I think it came when I folded csum_partial() and do_csum(), I forgot
to ror() the seed.

I suspect the following would help:
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 1eb8f2d11f7c785be624eba315fe9ca7989fd56d..ee7b0e7a6055bcbef42d22f7e1d8f52ddbd6be6d
100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -41,6 +41,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
        if (unlikely(odd)) {
                if (unlikely(len == 0))
                        return sum;
+               temp64 = ror32((__force u64)sum, 8);
                temp64 += (*(unsigned char *)buff << 8);
                len--;
                buff++;
You can save an instruction (as if this path matters) by:
                        temp64 = sum + *(unsigned char *)buff;
This might overflow, sum is 32bit.

Given that we have temp64 = (__force u64)sum;  already done at this
point, we can instead

temp64 += *(u8 *)buff;
                        temp64 <<= 8;

Although that probably falls foul of 64bit shifts being slow.
Are they slower than the ror32(sum, 8) ?
So maybe just:
                        sum += *(unsigned char *)buff;
we might miss a carry/overflow here
                        temp64 = bswap32(sum);
AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
the same speed but may use different execution units.

Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
in sandy bridge - and still not fixed it.
Although the compiler might be making a pigs-breakfast of the
register allocation when you tried setting 'odd = 8'.

Weeks can be spent fiddling with this code :-(
Yes, and in the end, it won't be able to compete with  a
specialized/inlined ipv6_csum_partial()

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help