Re: [tip:x86/core 1/1] arch/x86/um/../lib/csum-partial_64.c:98:12: error:... | linux-um

(off-list ancestor, not in this archive)

On Fri, Nov 26, 2021 at 9:18 AM David Laight [off-list ref] wrote:
From: Eric Dumazet
Sent: 25 November 2021 04:01
...
The outputs seem to match if `buff` is aligned to 64-bit. Still see
difference with `csum_fold(csum_partial())` if `buff` is not 64-bit aligned.

The comment at the top says it's "best" to have `buff` 64-bit aligned but
the code logic seems meant to support the misaligned case so not
sure if it's an issue.
It is an issue in general, not in standard cases because network
headers are aligned.

I think it came when I folded csum_partial() and do_csum(), I forgot
to ror() the seed.

I suspect the following would help:

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 1eb8f2d11f7c785be624eba315fe9ca7989fd56d..ee7b0e7a6055bcbef42d22f7e1d8f52ddbd6be6d
100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -41,6 +41,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
        if (unlikely(odd)) {
                if (unlikely(len == 0))
                        return sum;
+               temp64 = ror32((__force u64)sum, 8);
                temp64 += (*(unsigned char *)buff << 8);
                len--;
                buff++;
You can save an instruction (as if this path matters) by:
                        temp64 = sum + *(unsigned char *)buff;
This might overflow, sum is 32bit.

Given that we have temp64 = (__force u64)sum;  already done at this
point, we can instead

temp64 += *(u8 *)buff;

                        temp64 <<= 8;


Although that probably falls foul of 64bit shifts being slow.
Are they slower than the ror32(sum, 8) ?

So maybe just:
                        sum += *(unsigned char *)buff;
we might miss a carry/overflow here

                        temp64 = bswap32(sum);

AFAICT (from a pdf) bswap32() and ror(x, 8) are likely to be
the same speed but may use different execution units.

Intel seem so have managed to slow down ror(x, %cl) to 3 clocks
in sandy bridge - and still not fixed it.
Although the compiler might be making a pigs-breakfast of the
register allocation when you tried setting 'odd = 8'.

Weeks can be spent fiddling with this code :-(
Yes, and in the end, it won't be able to compete with  a
specialized/inlined ipv6_csum_partial()

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help