Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim

From: Alexander Monakov <hidden>
Date: 2025-08-11 14:14:50

On Mon, 11 Aug 2025, Phillip Wood wrote:

quoted

That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)

It would be helpful to add a column with those calculations in it rather than
forcing the reader to calculate the speed up for themselves.

Ok, will change it to

version | speedup over (A) | cycles, bn | instructions, bn
----------------------------------------------------------
A                            6.38         11.3
B         1.027              6.21         10.89
C         1.1                5.80          9.95
D         1.094              5.83          8.74
----------------------------------------------------------

Also what is the cycles column measuring? What is it that takes 6.21 cycles
for B and only 5.8 cycles for C?

Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles.

quoted

Then you get 9% from the inlining patch and only 2% from the faster hash
function? That's a bit surprising, which compiler and CPU you used? Is it
with default optimization (-O2)?

I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement
from the inlining when I was playing with xxhash.

Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That
said, I think most users will get Git from their distro, without -march=native,
right? So I'd suggest looking at plain -O2, especially for xxhash, which
selects hashing primitives based on CPU-indicating predefined macros.

quoted

I'd say under reasonable assumptions (e.g. a not too ancient CPU with
3-cycle integer multiplication) the new scheme is generally faster even
without asm.

Thanks, fwiw I don't see a measurable difference in the timings with and
without the asm on my machine -

To be clear, by "without asm" you mean forcing the !__GNUC__ branch where
REASSOC_FENCE macro is empty?

sometimes one is faster, sometimes the other, any difference is within the
noise.

Would you mind showing your 'gcc --version'? Also, I prefer 'perf stat' for
such measurements, because its measurements are not so sensitive to frequency
scaling (plus, you can compare my cycles/instructions counts with yours if you
run 'perf stat', but I cannot compare your seconds from hyperfine with mine
because of course my CPU runs at a different frequency than yours).

'perf stat -r 5' runs the workload 5 times and prints averages and deviation.

quoted

No, what we need to do here is outside of the abstract machine's view,
standard functions are not going to help.

That's a shame. I'd hoped that stopping the compiler reorder the code would do
the same thing - what is the asm doing that's different?

atomic_signal_fence only blocks reordering of references to memory that can be
observed from a signal handler interrupting the current thread. It has no effect
on variables whose addresses do not escape (let alone never taken in the first
place). Here we want to force a particular evaluation order for variables that
end up on registers and are not supposed to appear in memory at all.

Alexander

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help