[PATCH net-next v6 19/23] zinc: Curve25519 ARM implementation

From: Jason@zx2c4.com (Jason A. Donenfeld)
Date: 2018-10-03 01:03:27
Also in: linux-crypto, lkml, netdev

(+Dan,Peter in CC. Replying to:
<https://lore.kernel.org/lkml/CAKv+Gu9FLDRLxHReKcveZYHNYerR5Y2pZd9gn-hWrU0jb2KgfA@mail.gmail.com/ (local)>
for context.)

Hi Ard,

On Tue, Oct 2, 2018 at 6:59 PM Ard Biesheuvel [off-list ref] wrote:

Shouldn't this use the new simd abstraction as well?

Yes, it probably should, thanks.

I guess qhasm means generated code, right?
Because many of these adds are completely redundant ...
This looks odd as well.
Could you elaborate on what qhasm is exactly? And, as with the other
patches, I would prefer it if we could have your changes as a separate
patch (although having the qhasm base would be preferred)

Indeed qhasm converts this --
<https://github.com/floodyberry/supercop/blob/master/crypto_scalarmult/curve25519/neon2/scalarmult.pq>
-- into this. It's a thing from Dan (CC'd now) --
<http://cr.yp.to/qhasm.html>. As you've requested, I can layer the
patches to show our changes on top.

... you can drop this add
same here
and here
and here
and here
and here
and here
and here
redundant add
I'll stop here - let me just note that this code does not strike me as
particularly well optimized for in-order cores (such as A7).
For instance, the sequence
can be reordered as
and not have every other instruction depend on the output of the previous one.
Obviously, the ultimate truth is in the benchmark numbers, but I'd
thought I'd mention it anyway.

Yes indeed the output is suboptimal in a lot of places. We can
gradually clean this up -- slowly and carefully over time -- if you
want. I can also look into producing a new implementation within HACL*
so that it's verified. Assurance-wise, though, I feel pretty good
about this implementation considering its origins, its breadth of use
(in BoringSSL), the fuzzing hours it's incurred, and the actual
implementation itself.

 Either way, performance-wise, it's really worth having.

For example, on a Cortex-A7, we get these results (according to get_cycles()):

neon: 23142 cycles per call
fiat32: 49136 cycles per call
donna32: 71988 cycles per call

And on a Cortex-A9, we get these results (according to get_cycles()):

neon: 5020 cycles per call
fiat32: 17326 cycles per call
donna32: 28076 cycles per call

Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help