Thread (9 messages) 9 messages, 2 authors, 2025-08-13

Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim

From: Alexander Monakov <hidden>
Date: 2025-08-11 14:14:50

On Mon, 11 Aug 2025, Phillip Wood wrote:
quoted
That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)
It would be helpful to add a column with those calculations in it rather than
forcing the reader to calculate the speed up for themselves.
Ok, will change it to

version | speedup over (A) | cycles, bn | instructions, bn
----------------------------------------------------------
A                            6.38         11.3
B         1.027              6.21         10.89
C         1.1                5.80          9.95
D         1.094              5.83          8.74
----------------------------------------------------------
Also what is the cycles column measuring? What is it that takes 6.21 cycles
for B and only 5.8 cycles for C?
Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles.
quoted
Then you get 9% from the inlining patch and only 2% from the faster hash
function? That's a bit surprising, which compiler and CPU you used? Is it
with default optimization (-O2)?
I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement
from the inlining when I was playing with xxhash.
Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That
said, I think most users will get Git from their distro, without -march=native,
right? So I'd suggest looking at plain -O2, especially for xxhash, which
selects hashing primitives based on CPU-indicating predefined macros.
quoted
I'd say under reasonable assumptions (e.g. a not too ancient CPU with
3-cycle integer multiplication) the new scheme is generally faster even
without asm.
Thanks, fwiw I don't see a measurable difference in the timings with and
without the asm on my machine -
To be clear, by "without asm" you mean forcing the !__GNUC__ branch where
REASSOC_FENCE macro is empty?
sometimes one is faster, sometimes the other, any difference is within the
noise.
Would you mind showing your 'gcc --version'? Also, I prefer 'perf stat' for
such measurements, because its measurements are not so sensitive to frequency
scaling (plus, you can compare my cycles/instructions counts with yours if you
run 'perf stat', but I cannot compare your seconds from hyperfine with mine
because of course my CPU runs at a different frequency than yours).

'perf stat -r 5' runs the workload 5 times and prints averages and deviation.
quoted
No, what we need to do here is outside of the abstract machine's view,
standard functions are not going to help.
That's a shame. I'd hoped that stopping the compiler reorder the code would do
the same thing - what is the asm doing that's different?
atomic_signal_fence only blocks reordering of references to memory that can be
observed from a signal handler interrupting the current thread. It has no effect
on variables whose addresses do not escape (let alone never taken in the first
place). Here we want to force a particular evaluation order for variables that
end up on registers and are not supposed to appear in memory at all.

Alexander
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help