Re: [PATCH 00/12] crypto: sha256 - Use partial block API
From: Eric Biggers <ebiggers@kernel.org>
Date: 2025-05-01 02:26:21
Also in:
linux-arch, linux-arm-kernel, linux-crypto, linux-mips, linux-riscv, linux-s390, lkml, sparclinux
On Thu, May 01, 2025 at 09:21:15AM +0800, Herbert Xu wrote:
On Wed, Apr 30, 2025 at 10:45:43AM -0700, Eric Biggers wrote:quoted
As for your sha256_finup "optimization", it's an interesting idea, but unfortunately it slightly slows down the common case which is count % 64 < 56, due to the unnecessary copy to the stack and the following zeroization. In the uncommon case where count % 64 >= 56 you do get to pass nblocks=2 to sha256_blocks_*(), but ultimately SHA-256 is serialized block-by-block anyway, so it ends up being only slightly faster in that case, which again is the uncommon case. So while it's an interesting idea, it doesn't seem to actually be better. And the fact that that patch is also being used to submit unrelated, more dubious changes isn't very helpful, of course.I'm more than willing to change sha256_finup if you can prove it with real numbers that it is worse than the single-block version.
Interesting approach -- pushing out misguided optimizations without data, then demanding data for them to be reverted. It's obviously worse for len % 64 < 56 for the reason I gave, so this is a waste of time IMO. But since you're insisting on data anyway, here are some quick benchmarks on AMD Zen 5 (not going to bother formatting into a table): Before your finup "optimization": sha256(len=0): 145 cycles sha256(len=1): 146 cycles sha256(len=2): 146 cycles sha256(len=3): 146 cycles sha256(len=4): 146 cycles sha256(len=5): 146 cycles sha256(len=6): 146 cycles sha256(len=7): 146 cycles sha256(len=8): 151 cycles sha256(len=9): 148 cycles sha256(len=10): 148 cycles sha256(len=11): 148 cycles sha256(len=12): 148 cycles sha256(len=13): 148 cycles sha256(len=14): 148 cycles sha256(len=15): 149 cycles sha256(len=16): 149 cycles sha256(len=17): 148 cycles sha256(len=18): 148 cycles sha256(len=19): 148 cycles sha256(len=20): 148 cycles sha256(len=21): 148 cycles sha256(len=22): 148 cycles sha256(len=23): 148 cycles sha256(len=24): 148 cycles sha256(len=25): 148 cycles sha256(len=26): 148 cycles sha256(len=27): 148 cycles sha256(len=28): 148 cycles sha256(len=29): 148 cycles sha256(len=30): 148 cycles sha256(len=31): 148 cycles sha256(len=32): 151 cycles sha256(len=33): 148 cycles sha256(len=34): 148 cycles sha256(len=35): 148 cycles sha256(len=36): 148 cycles sha256(len=37): 148 cycles sha256(len=38): 148 cycles sha256(len=39): 148 cycles sha256(len=40): 148 cycles sha256(len=41): 148 cycles sha256(len=42): 148 cycles sha256(len=43): 148 cycles sha256(len=44): 148 cycles sha256(len=45): 148 cycles sha256(len=46): 150 cycles sha256(len=47): 149 cycles sha256(len=48): 147 cycles sha256(len=49): 147 cycles sha256(len=50): 147 cycles sha256(len=51): 147 cycles sha256(len=52): 147 cycles sha256(len=53): 147 cycles sha256(len=54): 147 cycles sha256(len=55): 148 cycles sha256(len=56): 278 cycles sha256(len=57): 278 cycles sha256(len=58): 278 cycles sha256(len=59): 278 cycles sha256(len=60): 277 cycles sha256(len=61): 277 cycles sha256(len=62): 277 cycles sha256(len=63): 276 cycles sha256(len=64): 276 cycles After your finup "optimization": sha256(len=0): 188 cycles sha256(len=1): 190 cycles sha256(len=2): 190 cycles sha256(len=3): 190 cycles sha256(len=4): 189 cycles sha256(len=5): 189 cycles sha256(len=6): 189 cycles sha256(len=7): 190 cycles sha256(len=8): 187 cycles sha256(len=9): 188 cycles sha256(len=10): 188 cycles sha256(len=11): 188 cycles sha256(len=12): 189 cycles sha256(len=13): 189 cycles sha256(len=14): 188 cycles sha256(len=15): 189 cycles sha256(len=16): 189 cycles sha256(len=17): 190 cycles sha256(len=18): 190 cycles sha256(len=19): 190 cycles sha256(len=20): 190 cycles sha256(len=21): 190 cycles sha256(len=22): 190 cycles sha256(len=23): 190 cycles sha256(len=24): 191 cycles sha256(len=25): 191 cycles sha256(len=26): 191 cycles sha256(len=27): 191 cycles sha256(len=28): 191 cycles sha256(len=29): 192 cycles sha256(len=30): 191 cycles sha256(len=31): 191 cycles sha256(len=32): 191 cycles sha256(len=33): 191 cycles sha256(len=34): 191 cycles sha256(len=35): 191 cycles sha256(len=36): 192 cycles sha256(len=37): 192 cycles sha256(len=38): 192 cycles sha256(len=39): 191 cycles sha256(len=40): 191 cycles sha256(len=41): 194 cycles sha256(len=42): 193 cycles sha256(len=43): 193 cycles sha256(len=44): 193 cycles sha256(len=45): 193 cycles sha256(len=46): 194 cycles sha256(len=47): 194 cycles sha256(len=48): 193 cycles sha256(len=49): 195 cycles sha256(len=50): 195 cycles sha256(len=51): 196 cycles sha256(len=52): 196 cycles sha256(len=53): 195 cycles sha256(len=54): 195 cycles sha256(len=55): 195 cycles sha256(len=56): 297 cycles sha256(len=57): 297 cycles sha256(len=58): 297 cycles sha256(len=59): 297 cycles sha256(len=60): 297 cycles sha256(len=61): 297 cycles sha256(len=62): 297 cycles sha256(len=63): 297 cycles sha256(len=64): 292 cycles So your "optimization" made it ~43 cycles slower for len % 64 < 56, or ~19 cycles slower for len % 64 >= 56. As I said, it's from the overhead of unnecessarily copying the data onto the stack and then having to zeroize it at the end. - Eric