Re: [PATCH 1/3] diff histogram: intern strings
From: Johannes Schindelin <hidden>
Date: 2021-11-18 15:36:28
Hi, On Wed, 17 Nov 2021, Phillip Wood wrote:
On 17/11/2021 15:55, Derrick Stolee wrote:quoted
On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:quoted
From: Phillip Wood <redacted> Histogram is the only diff algorithm not to call xdl_classify_record(). xdl_classify_record() ensures that the hash values of two strings that are not equal differ which means that it is not necessary to use xdl_recmatch() when comparing lines, all that is necessary is to compare the hash values. This gives a 7% reduction in the runtime of "git log --patch" when using the histogram diff algorithm. Test HEAD^ HEAD ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) +5.6% 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) -1.0% 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) -0.6% 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) -7.4% 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) -0.7% Signed-off-by: Phillip Wood <redacted> --- xdiff/xhistogram.c | 5 ++--- xdiff/xprepare.c | 24 ++++++++---------------- 2 files changed, 10 insertions(+), 19 deletions(-)diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c index e694bfd9e31..6c1c88a69a1 100644 --- a/xdiff/xhistogram.c +++ b/xdiff/xhistogram.c@@ -91,9 +91,8 @@ struct region { static int cmp_recs(xpparam_t const *xpp, xrecord_t *r1, xrecord_t *r2) { - return r1->ha == r2->ha && - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, - xpp->flags); + return r1->ha == r2->ha; +nit: stray newline. The only meaningful change here is that you are relying entirely on the hash and not checking the content again. This means that hash collisions on this 32-bit hash could start introducing different results. Are we worried about that?xdiff-interface.c limits the size of the file that can be passed to just below 1GB so we are safe. The other diff algorithms are already using this optimization. (the hash is 64 bits on most platforms, the xdiff code could really benefit from a unsigned long -> size_t cleanup)
I think the really important thing to point out is that `xdl_classify_record()` ensures that the `ha` attribute is different for different text. AFAIR it even "linearizes" the `ha` values, i.e. they won't be all over the place but start at 0 (or 1). So no, I'm not worried about collisions. That would be a bug in `xdl_classify_record()` and I think we would have caught this bug by now. Ciao, Dscho