Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: 2019-09-26 11:48:40
Also in:
linux-arm-kernel, linux-crypto
"Jason A. Donenfeld" [off-list ref] writes:
[CC +willy, toke, dave, netdev] Hi Pascal On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen [off-list ref] wrote:quoted
Actually, that assumption is factually wrong. I don't know if anything is *publicly* available, but I can assure you the silicon is running in labs already. And something will be publicly available early next year at the latest. Which could nicely coincide with having Wireguard support in the kernel (which I would also like to see happen BTW) ... Not "at some point". It will. Very soon. Maybe not in consumer or server CPUs, but definitely in the embedded (networking) space. And it *will* be much faster than the embedded CPU next to it, so it will be worth using it for something like bulk packet encryption.Super! I was wondering if you could speak a bit more about the interface. My biggest questions surround latency. Will it be synchronous or asynchronous? If the latter, why? What will its latencies be? How deep will its buffers be? The reason I ask is that a lot of crypto acceleration hardware of the past has been fast and having very deep buffers, but at great expense of latency. In the networking context, keeping latency low is pretty important. Already WireGuard is multi-threaded which isn't super great all the time for latency (improvements are a work in progress). If you're involved with the design of the hardware, perhaps this is something you can help ensure winds up working well? For example, AES-NI is straightforward and good, but Intel can do that because they are the CPU. It sounds like your silicon will be adjacent. How do you envision this working in a low latency environment?
Being asynchronous doesn't *necessarily* have to hurt latency; you just need the right queue back-pressure. We already have multiple queues in the stack. With an async crypto engine we would go from something like: stack -> [qdisc] -> wg if -> [wireguard buffer] -> netdev driver -> device -> [device buffer] -> wire to stack -> [qdisc] -> wg if -> [wireguard buffer] -> crypto stack -> crypto device -> [crypto device buffer] -> wg post-crypto -> netdev driver -> device -> [device buffer] -> wire (where everything in [] is a packet queue). The wireguard buffer is the source of the latency you're alluding to above (the comment about multi-threaded behaviour), so we probably need to fix that anyway. For the device buffer we have BQL to keep it at a minimum. So that leaves the buffering in the crypto offload device. If we add something like BQL to the crypto offload drivers, we could conceivably avoid having that add a significant amount of latency. In fact, doing so may benefit other users of crypto offloads as well, no? Presumably ipsec has this same issue? Caveat: I am fairly ignorant about the inner workings of the crypto subsystem, so please excuse any inaccuracies in the above; the diagrams are solely for illustrative purposes... :) -Toke