Thread (7 messages) 7 messages, 5 authors, 16d ago

Re: [PATCH] crypto: nx: fix nx_crypto_ctx_exit argument

From: Simon Richter <hidden>
Date: 2026-05-23 06:30:18
Also in: linux-crypto

Hi,

On 5/23/26 03:44, Eric Biggers wrote:
Otherwise this looks good.  Really there's a good chance this driver is
no longer useful (if it ever was) and should just be deleted, but that
would be a separate effort.
I happen to have one (well, two) of these, so this is relevant to my 
interests.

tl;dr: the crypto drivers are most likely unused, the hardware is great, 
but the crypto subsystem cannot use it efficiently.

Below drivers/crypto/nx, there are three drivers in a trenchcoat:

  - an NX crypto driver that is not endian safe, can therefore only be 
used on big endian systems, and that implements a bunch of AES modes 
plus SHA256/SHA512, all of them synchronous.
  - an scomp driver with an IBM specific compression algorithm
  - a gzip driver that does not integrate with the crypto subsystem and 
provides its own userspace interface.

The "big endian only" thing is a massive restriction, this is how IBM 
separates enterprise and hobbyist customers, so if there are users of 
this module, then they both have enterprise support contracts.

The gzip mode is really useful, with 4 GB of random data I get

$ time ./nx_gzip test.bin
real 0m2.989s
user 0m1.317s
sys  0m1.665s

$ time gzip -9k test.bin
real 2m57.468s
user 2m55.325s
sys  0m1.682s

so 3 GB/s vs 22 MB/s. Even if I had a workload where I could use all the 
CPU cores in parallel, offloading is still faster, 120W cheaper and 
leaves the CPU free as a bonus, so I think that's a no-brainer.

The "842" compression is mainly designed to be fast, the marketing 
material claims > 25 GB/s, which makes sense, this unit sits on a 128 
bit wide bus clocked at 2 GHz, and the algorithm is designed around 
that. On the other hand it is fairly niche.

I couldn't find numbers for the AES and SHA units, I'd expect them to be 
in the same ballpark, but I cannot measure them easily. CPU is ~500 MB/s 
for SHA1 and SHA512, ~300 MB/s for SHA256, that should be easy to beat 
(even a primitive 2-way SHA256 would be at 4 GB/s, and I doubt IBM left 
it at that).

POWER11 introduces new opcodes, which will shake things up, but these 
machines are on a fairly long replacement cycle.

The main problem with getting the advertised performance is feeding 
requests fast enough. Large requests are easy, but the optimum strategy 
for feeding small requests is just to start submitting, poll old 
requests for completion inbetween, and start requesting interrupts only 
if nothing is complete and it looks like the unit will be busy for a while.

That's not what is currently implemented, and I doubt it could be 
implemented with the current kernel interfaces, so getting decent 
performance inside the kernel would require some redesign.

I suppose that also explains the synchronous implementation: we are 
submitting the request and polling for completion, so overhead is fairly 
minimal and should break even at a few hundred bytes, but obviously that 
is not the ideal way to run this thing.

The endianness issues are trivial to fix (really just needs a sprinkle 
of cpu_to_beXX/beXX_to_cpu when putting the job control blocks together, 
like nx-842 does); if you have a definition of what you would consider a 
"real world" workload for AES I could run that to gather some numbers.

So far however, no one bothered fixing this, and I'm pretty meh about it 
myself since I don't have SHA/AES workloads in the kernel, only in 
userspace.

Other than that, if you decide to remove the driver from the crypto 
subsystem, then nx-gzip should be kept (and probably moved somewhere 
else), because it is not a crypto driver, it just shares a bunch of 
headers with them.

    Simon

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help