RE: [PATCH] crypto: arm64/gcm-ce - unroll factors to 4-way interleave of aes... | linux-arm-kernel

quoted

On Thu, September 30, 2021 10:57 PM, Ard Biesheuvel [off-list ref]
wrote:
On Thu, 30 Sept 2021 at 03:32, Xiaokang Qian [off-list ref]
wrote:
Thanks for the review.

I will firstly change the decrypt path to compare the tag using SIMD code,
and then  pass all of the self tests include fuzz tests(enabled by
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y), big endian ,little endian
tests.
OK

About the 1K data point, I just remember that the 1420 bytes packet is
commonly used in IPSEC.
Yes, but your code is faster than the existing code for 1420 byte packets, right?
So why should we keep the original code? We don't use GCM for block
storage, and if IPsec throughput is a key performance metric for your system,
you are likely to be using the maximum packet size so 1420 bytes not 1k.

Yes, the code is faster than the existing code for 1420 bytes packets, and the bigger the data size, the more the performance is uplifted.
But there is one issue,  our code will interleave 4 blocks for crypto-AES instructions and another 4 blocks for ghash(pmull) in parallel, so 
it's more friendly to the bigger data size but not friendly to the smaller ones.
For the data size that is smaller than 1k data size, the performance  will have some regression. 
So we keep the two driver exist together.

-----Original Message-----
From: Ard Biesheuvel <ardb@kernel.org>
Sent: Wednesday, September 29, 2021 5:04 AM
To: Eric Biggers <ebiggers@kernel.org>
Cc: Xiaokang Qian <redacted>; Herbert Xu
[off-list ref]; David S. Miller [off-list ref];
Catalin Marinas [off-list ref]; Will Deacon
[off-list ref]; nd [off-list ref]; Linux Crypto Mailing List
[off-list ref]; Linux ARM
[off-list ref]; Linux Kernel Mailing List
[off-list ref]
Subject: Re: [PATCH] crypto: arm64/gcm-ce - unroll factors to 4-way
interleave of aes and ghash

On Tue, 28 Sept 2021 at 08:27, Eric Biggers [off-list ref] wrote:
On Thu, Sep 23, 2021 at 06:30:25AM +0000, XiaokangQian wrote:
To improve performance on cores with deep piplines such as A72,N1,
implement gcm(aes) using a 4-way interleave of aes and ghash
(totally
8 blocks in parallel), which can make full utilize of pipelines
rather than the 4-way interleave we used currently. It can gain
about 20% for big data sizes such that 8k.

This is a complete new version of the GCM part of the combined
GCM/GHASH driver, it will co-exist with the old driver, only serve
for big data sizes. Instead of interleaving four invocations of
AES where each chunk of 64 bytes is encrypted first and then
ghashed, the new version uses a more coarse grained approach where
a chunk of
64 bytes is encrypted and at the same time, one chunk of 64 bytes
is ghashed (or ghashed and decrypted in the converse case).

The table below compares the performance of the old driver and the
new one on various micro-architectures and running in various
modes with various data sizes.

            |     AES-128       |     AES-192       |     AES-256       |
     #bytes | 1024 | 1420 |  8k | 1024 | 1420 |  8k | 1024 | 1420 |  8k |
     -------+------+------+-----+------+------+-----+------+------+-----+
        A72 | 5.5% |  12% | 25% | 2.2% |  9.5%|  23%| -1%  |  6.7%| 19% |
        A57 |-0.5% |  9.3%| 32% | -3%  |  6.3%|  26%| -6%  |  3.3%| 21% |
        N1  | 0.4% |  7.6%|24.5%| -2%  |  5%  |  22%| -4%  |
2.7%| 20% |

Signed-off-by: XiaokangQian <redacted>
Does this pass the self-tests, including the fuzz tests which are
enabled by CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?
Please test both little-endian and big-endian. (Note that you don't
need a big-endian user space for this - the self tests are executed
before the rootfs is mounted)

Also, you will have to rebase this onto the latest cryptodev tree, which
carries some changes I made recently to this driver.
Finally, I'd like to discuss whether we really need two separate
drivers here. The 1k data point is not as relevant as the other ones,
which show a worthwhile speedup for all micro architectures and data
sizes (although I will give this a spin on TX2 myself when I have the
chance)

*If* we switch to this implementation completely, I would like to keep the
improvement I added recently to the decrypt path to compare the tag using
SIMD code, rather than copying it out and using memcmp().
Could you look into adopting this for this version as well?

--
Ard.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help