Thread (28 messages) 28 messages, 5 authors, 2025-06-14

Re: [PATCH v2 00/12] lib/crc: improve how arch-optimized code is integrated

From: Eric Biggers <ebiggers@kernel.org>
Date: 2025-06-10 19:12:32
Also in: linux-arch, linux-arm-kernel, linux-crypto, linux-mips, linux-riscv, linux-s390, lkml, loongarch, sparclinux

On Tue, Jun 10, 2025 at 11:39:22AM -0600, Jason A. Donenfeld wrote:
On Sun, Jun 08, 2025 at 04:48:17PM -0700, Eric Biggers wrote:
quoted
On Sat, Jun 07, 2025 at 05:47:02PM -0600, Jason A. Donenfeld wrote:
quoted
On Sat, Jun 07, 2025 at 01:04:42PM -0700, Eric Biggers wrote:
quoted
Having arch-specific code outside arch/ was somewhat controversial when
Zinc proposed it back in 2018.  But I don't think the concerns are
warranted.  It's better from a technical perspective, as it enables the
improvements mentioned above.  This model is already successfully used
in other places in the kernel such as lib/raid6/.  The community of each
architecture still remains free to work on the code, even if it's not in
arch/.  At the time there was also a desire to put the library code in
the same files as the old-school crypto API, but that was a mistake; now
that the library is separate, that's no longer a constraint either.
I can't express how happy I am to see this revived. It's clearly the
right way forward and makes it a lot simpler for us to dispatch to
various arch implementations and also is organizationally simpler.

Jason
Thanks!  Can I turn that into an Acked-by?
Took me a little while longer to fully review it. Sure,

    Acked-by: Jason A. Donenfeld [off-list ref]

Side note: I wonder about eventually turning some of the static branches
into static calls.
Yes, Linus was wondering the same thing earlier.  It does run into a couple
issues.  First, only x86 and powerpc implement static_call properly; everywhere
else it's just an indirect call.  Second, there's often some code shared above
the level at which we'd like to do the dispatch.  For example, consider crc32_le
on x86.  If we expand the CRC_PCLMUL macro and inline crc32_le_arch and
crc32_le_base as the compiler does, crc32_le ends up as:

    u32 crc32_le(u32 crc, const u8 *p, size_t len)
    {
            if (len >= 16 && static_branch_likely(&have_pclmulqdq) &&
                crypto_simd_usable()) {
                    const void *consts_ptr;

                    consts_ptr = crc32_lsb_0xedb88320_consts.fold_across_128_bits_consts;
                    kernel_fpu_begin();
                    crc = static_call(crc32_lsb_pclmul)(crc, p, len, consts_ptr);
                    kernel_fpu_end();
                    return crc;
            }
            while (len--)
                    crc = (crc >> 8) ^ crc32table_le[(crc & 255) ^ *p++];
            return crc;
    }

The existing static_call selects between 3 different assembly functions, all of
which require a kernel-mode FPU section and only support len >= 16.

We could instead unconditionally do a static_call upon entry to the function,
with 4 possible targets.  But then we'd have to duplicate the kernel FPU
begin/end sequence in 3 different functions.  Also, it would add an extra
function call for the case where 'len < 16', which is a common case and is
exactly the case where per-call overhead matters the most.

However, if we could actually inline the static call into the *callers* of
crc32_le(), that would make it more worthwhile.  I'm not sure that's possible,
though, especially considering that this code is tristate.

Anyway, this is tangential to this patchset.  Though the new way the code is
organized does make it more feasible to have e.g. a centralized static_call in
the future if we choose to go in that direction.

- Eric
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help