Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto:... | netdev

Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]

From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: 2019-09-26 11:48:40
Also in: linux-arm-kernel, linux-crypto

"Jason A. Donenfeld" [off-list ref] writes:

[CC +willy, toke, dave, netdev]

Hi Pascal

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
[off-list ref] wrote:

quoted

Actually, that assumption is factually wrong. I don't know if anything
is *publicly* available, but I can assure you the silicon is running in
labs already. And something will be publicly available early next year
at the latest. Which could nicely coincide with having Wireguard support
in the kernel (which I would also like to see happen BTW) ...

Not "at some point". It will. Very soon. Maybe not in consumer or server
CPUs, but definitely in the embedded (networking) space.
And it *will* be much faster than the embedded CPU next to it, so it will
be worth using it for something like bulk packet encryption.

Super! I was wondering if you could speak a bit more about the
interface. My biggest questions surround latency. Will it be
synchronous or asynchronous? If the latter, why? What will its
latencies be? How deep will its buffers be? The reason I ask is that a
lot of crypto acceleration hardware of the past has been fast and
having very deep buffers, but at great expense of latency. In the
networking context, keeping latency low is pretty important. Already
WireGuard is multi-threaded which isn't super great all the time for
latency (improvements are a work in progress). If you're involved with
the design of the hardware, perhaps this is something you can help
ensure winds up working well? For example, AES-NI is straightforward
and good, but Intel can do that because they are the CPU. It sounds
like your silicon will be adjacent. How do you envision this working
in a low latency environment?

Being asynchronous doesn't *necessarily* have to hurt latency; you just
need the right queue back-pressure.


We already have multiple queues in the stack. With an async crypto
engine we would go from something like:

stack -> [qdisc] -> wg if -> [wireguard buffer] -> netdev driver ->
device -> [device buffer] -> wire

to

stack -> [qdisc] -> wg if -> [wireguard buffer] -> crypto stack ->
crypto device -> [crypto device buffer] -> wg post-crypto -> netdev
driver -> device -> [device buffer] -> wire

(where everything in [] is a packet queue).

The wireguard buffer is the source of the latency you're alluding to
above (the comment about multi-threaded behaviour), so we probably need
to fix that anyway. For the device buffer we have BQL to keep it at a
minimum. So that leaves the buffering in the crypto offload device. If
we add something like BQL to the crypto offload drivers, we could
conceivably avoid having that add a significant amount of latency. In
fact, doing so may benefit other users of crypto offloads as well, no?
Presumably ipsec has this same issue?


Caveat: I am fairly ignorant about the inner workings of the crypto
subsystem, so please excuse any inaccuracies in the above; the diagrams
are solely for illustrative purposes... :)

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help