RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
From: Pascal Van Leeuwen <hidden>
Date: 2019-09-27 10:12:01
Also in:
linux-crypto
-----Original Message----- From: Linus Torvalds <torvalds@linux-foundation.org> Sent: Friday, September 27, 2019 4:06 AM To: Pascal Van Leeuwen <redacted> Cc: Ard Biesheuvel <redacted>; Linux Crypto Mailing List <linux- crypto@vger.kernel.org>; Linux ARM [off-list ref]; Herbert Xu [off-list ref]; David Miller [off-list ref]; Greg KH [off-list ref]; Jason A . Donenfeld [off-list ref]; Samuel Neves [off-list ref]; Dan Carpenter [off-list ref]; Arnd Bergmann [off-list ref]; Eric Biggers [off-list ref]; Andy Lutomirski [off-list ref]; Will Deacon [off-list ref]; Marc Zyngier [off-list ref]; Catalin Marinas [off-list ref] Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen [off-list ref] wrote:quoted
But even the CPU only thing may have several implementations, of which you want to select the fastest one supported by the _detected_ CPU features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.) Do you think this would still be efficient if that would be some large if-else tree? Also, such a fixed implementation wouldn't scale.Just a note on this part. Yes, with retpoline a large if-else tree is actually *way* better for performance these days than even just one single indirect call. I think the cross-over point is somewhere around 20 if-statements.
Yikes, that is just _horrible_ :-( _However_ there's many CPU architectures out there that _don't_ need the retpoline mitigation and would be unfairly penalized by the deep if-else tree (as opposed to the indirect branch) for a problem they did not cause in the first place. Wouldn't it be more fair to impose the penalty on the CPU's actually _causing_ this problem? Also because those are generally the more powerful CPU's anyway, that would suffer the least from the overhead?
But those kinds of things also are things that we already handle well with instruction rewriting, so they can actually have even less of an overhead than a conditional branch. Using code like if (static_cpu_has(X86_FEATURE_AVX2)) actually ends up patching the code at run-time, so you end up having just an unconditional branch. Exactly because CPU feature choices often end up being in critical code-paths where you have one-or-the-other kind of setup. And yes, one of the big users of this is very much the crypto library code.
Ok, I didn't know about that. So I suppose we could have something like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ...
The code to do the above is disgusting, and when you look at the
generated code you see odd unreachable jumps and what looks like a
slow "bts" instruction that does the testing dynamically.
And then the kernel instruction stream gets rewritten fairly early
during the boot depending on the actual CPU capabilities, and the
dynamic tests get overwritten by a direct jump.
Admittedly I don't think the arm64 people go to quite those lengths,
but it certainly wouldn't be impossible there either. It just takes a
bit of architecture knowledge and a strong stomach ;)
LinusRegards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel