Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim
From: Alexander Monakov <hidden>
Date: 2025-08-11 14:14:50
On Mon, 11 Aug 2025, Phillip Wood wrote:
quoted
That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)It would be helpful to add a column with those calculations in it rather than forcing the reader to calculate the speed up for themselves.
Ok, will change it to version | speedup over (A) | cycles, bn | instructions, bn ---------------------------------------------------------- A 6.38 11.3 B 1.027 6.21 10.89 C 1.1 5.80 9.95 D 1.094 5.83 8.74 ----------------------------------------------------------
Also what is the cycles column measuring? What is it that takes 6.21 cycles for B and only 5.8 cycles for C?
Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles.
quoted
Then you get 9% from the inlining patch and only 2% from the faster hash function? That's a bit surprising, which compiler and CPU you used? Is it with default optimization (-O2)?I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement from the inlining when I was playing with xxhash.
Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That said, I think most users will get Git from their distro, without -march=native, right? So I'd suggest looking at plain -O2, especially for xxhash, which selects hashing primitives based on CPU-indicating predefined macros.
quoted
I'd say under reasonable assumptions (e.g. a not too ancient CPU with 3-cycle integer multiplication) the new scheme is generally faster even without asm.Thanks, fwiw I don't see a measurable difference in the timings with and without the asm on my machine -
To be clear, by "without asm" you mean forcing the !__GNUC__ branch where REASSOC_FENCE macro is empty?
sometimes one is faster, sometimes the other, any difference is within the noise.
Would you mind showing your 'gcc --version'? Also, I prefer 'perf stat' for such measurements, because its measurements are not so sensitive to frequency scaling (plus, you can compare my cycles/instructions counts with yours if you run 'perf stat', but I cannot compare your seconds from hyperfine with mine because of course my CPU runs at a different frequency than yours). 'perf stat -r 5' runs the workload 5 times and prints averages and deviation.
quoted
No, what we need to do here is outside of the abstract machine's view, standard functions are not going to help.That's a shame. I'd hoped that stopping the compiler reorder the code would do the same thing - what is the asm doing that's different?
atomic_signal_fence only blocks reordering of references to memory that can be observed from a signal handler interrupting the current thread. It has no effect on variables whose addresses do not escape (let alone never taken in the first place). Here we want to force a particular evaluation order for variables that end up on registers and are not supposed to appear in memory at all. Alexander