Re: [PATCH] crypto: nx: fix nx_crypto_ctx_exit argument
From: Simon Richter <hidden>
Date: 2026-05-23 06:30:18
Also in:
linux-crypto
Hi, On 5/23/26 03:44, Eric Biggers wrote:
Otherwise this looks good. Really there's a good chance this driver is no longer useful (if it ever was) and should just be deleted, but that would be a separate effort.
I happen to have one (well, two) of these, so this is relevant to my
interests.
tl;dr: the crypto drivers are most likely unused, the hardware is great,
but the crypto subsystem cannot use it efficiently.
Below drivers/crypto/nx, there are three drivers in a trenchcoat:
- an NX crypto driver that is not endian safe, can therefore only be
used on big endian systems, and that implements a bunch of AES modes
plus SHA256/SHA512, all of them synchronous.
- an scomp driver with an IBM specific compression algorithm
- a gzip driver that does not integrate with the crypto subsystem and
provides its own userspace interface.
The "big endian only" thing is a massive restriction, this is how IBM
separates enterprise and hobbyist customers, so if there are users of
this module, then they both have enterprise support contracts.
The gzip mode is really useful, with 4 GB of random data I get
$ time ./nx_gzip test.bin
real 0m2.989s
user 0m1.317s
sys 0m1.665s
$ time gzip -9k test.bin
real 2m57.468s
user 2m55.325s
sys 0m1.682s
so 3 GB/s vs 22 MB/s. Even if I had a workload where I could use all the
CPU cores in parallel, offloading is still faster, 120W cheaper and
leaves the CPU free as a bonus, so I think that's a no-brainer.
The "842" compression is mainly designed to be fast, the marketing
material claims > 25 GB/s, which makes sense, this unit sits on a 128
bit wide bus clocked at 2 GHz, and the algorithm is designed around
that. On the other hand it is fairly niche.
I couldn't find numbers for the AES and SHA units, I'd expect them to be
in the same ballpark, but I cannot measure them easily. CPU is ~500 MB/s
for SHA1 and SHA512, ~300 MB/s for SHA256, that should be easy to beat
(even a primitive 2-way SHA256 would be at 4 GB/s, and I doubt IBM left
it at that).
POWER11 introduces new opcodes, which will shake things up, but these
machines are on a fairly long replacement cycle.
The main problem with getting the advertised performance is feeding
requests fast enough. Large requests are easy, but the optimum strategy
for feeding small requests is just to start submitting, poll old
requests for completion inbetween, and start requesting interrupts only
if nothing is complete and it looks like the unit will be busy for a while.
That's not what is currently implemented, and I doubt it could be
implemented with the current kernel interfaces, so getting decent
performance inside the kernel would require some redesign.
I suppose that also explains the synchronous implementation: we are
submitting the request and polling for completion, so overhead is fairly
minimal and should break even at a few hundred bytes, but obviously that
is not the ideal way to run this thing.
The endianness issues are trivial to fix (really just needs a sprinkle
of cpu_to_beXX/beXX_to_cpu when putting the job control blocks together,
like nx-842 does); if you have a definition of what you would consider a
"real world" workload for AES I could run that to gather some numbers.
So far however, no one bothered fixing this, and I'm pretty meh about it
myself since I don't have SHA/AES workloads in the kernel, only in
userspace.
Other than that, if you decide to remove the driver from the crypto
subsystem, then nx-gzip should be kept (and probably moved somewhere
else), because it is not a crypto driver, it just shares a bunch of
headers with them.
Simon Attachments
- OpenPGP_signature.asc [application/pgp-signature] 488 bytes