Thread (17 messages) 17 messages, 6 authors, 2026-01-19

Re: [PATCH] compiler_types: Introduce inline_for_performance

From: Nicolas Pitre <nico@fluxnic.net>
Date: 2026-01-19 19:44:53
Also in: lkml

On Mon, 19 Jan 2026, David Laight wrote:
On Mon, 19 Jan 2026 10:47:51 -0500 (EST)
Nicolas Pitre [off-list ref] wrote:
quoted
On Sun, 18 Jan 2026, David Laight wrote:
quoted
On 32bit you probably don't want to inline __arch_xprod_64(), but you do
want to pass (bias ? m : 0) and may want separate functions for the
'no overflow' case (if it is common enough to worry about).  
You do want to inline it. Performance quickly degrades otherwise.
If it isn't inlined you want a real C function in div.c (or similar),
not the compiler generating a separate body in the object file of each
file that uses it.
Yes you absolutely do in this very particular case. This relies on a 
long sequence of code that collapses to only a few assembly instructions 
due to constant propagation. But most of the time gcc is not smart 
enough to realize that (strangely enough it used to be fine more than 10 
years ago). The corresponding function is not only slower but actually 
creates bigger code from the argument passing handling overhead.
quoted
And __arch_xprod_64() exists only for 32bit btw.
I wonder how much of a mess gcc makes of that code.
I added asm functions for u64 mul_add(u32 a, u32 b, u32 c) calculating
a * b + c without explicit zero extending any of the 32 bit values.
Without that gcc runs out of registers and starts spilling to stack
instead of just generating 'mul; add; adc $0'.
Here this is different. Let me copy the definition:

* Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
* Semantic:  retval = ((bias ? m : 0) + m * n) >> 64
* 
* The product is a 128-bit value, scaled down to 64 bits.
* Hoping for compile-time optimization of  conditional code.
* Architectures may provide their own optimized assembly implementation.

ARM32 provides its own definition. Last time I checked, RV32 already 
produced optimal code from the default C implementation.
But 64bit systems without a 64x64=>128 multiply (ie without u128
support) also need the 'multiply in 32bit chunks' code.
Again this is only for 32-bit systems. 64-bit systems use none of that.


Nicolas
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help