Re: [PATCH 1/3] diff histogram: intern strings

From: Johannes Schindelin <hidden>
Date: 2021-11-18 15:36:28

Hi,

On Wed, 17 Nov 2021, Phillip Wood wrote:

On 17/11/2021 15:55, Derrick Stolee wrote:

quoted

On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:

quoted

From: Phillip Wood <redacted>

Histogram is the only diff algorithm not to call
xdl_classify_record(). xdl_classify_record() ensures that the hash
values of two strings that are not equal differ which means that it is
not necessary to use xdl_recmatch() when comparing lines, all that is
necessary is to compare the hash values. This gives a 7% reduction in
the runtime of "git log --patch" when using the histogram diff
algorithm.

Test                                  HEAD^             HEAD
-----------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02)
+5.6%
4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20)
-1.0%
4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64)
-0.6%
4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66)
-7.4%
4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73)
-0.7%

Signed-off-by: Phillip Wood <redacted>
---
  xdiff/xhistogram.c |  5 ++---
  xdiff/xprepare.c   | 24 ++++++++----------------
  2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index e694bfd9e31..6c1c88a69a1 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c

@@ -91,9 +91,8 @@ struct region {
  static int cmp_recs(xpparam_t const *xpp,
  	xrecord_t *r1, xrecord_t *r2)
  {
-	return r1->ha == r2->ha &&
-		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
-			    xpp->flags);
+	return r1->ha == r2->ha;
+

nit: stray newline.

The only meaningful change here is that you are relying entirely on
the hash and not checking the content again. This means that hash
collisions on this 32-bit hash could start introducing different
results. Are we worried about that?

xdiff-interface.c limits the size of the file that can be passed to just below
1GB so we are safe. The other diff algorithms are already using this
optimization. (the hash is 64 bits on most platforms, the xdiff code could
really benefit from a unsigned long -> size_t cleanup)

I think the really important thing to point out is that
`xdl_classify_record()` ensures that the `ha` attribute is different for
different text. AFAIR it even "linearizes" the `ha` values, i.e. they
won't be all over the place but start at 0 (or 1).

So no, I'm not worried about collisions. That would be a bug in
`xdl_classify_record()` and I think we would have caught this bug by now.

Ciao,
Dscho

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help