Thread (79 messages) 79 messages, 10 authors, 2025-08-05

Re: [PATCH v6 4/4] last-modified: use Bloom filters when available

From: Patrick Steinhardt <hidden>
Date: 2025-07-31 06:43:15

On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.

When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.

If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.

Comparing the perf test results on git.git:

    Test                                        HEAD~             HEAD
    ------------------------------------------------------------------------------------
    8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
    8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
    8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%
Nice results.
quoted hunk ↗ jump to hunk
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index e4c73464c7..19bf25f8a5 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
 	}
 }
 
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+	struct bloom_filter *filter;
+	struct last_modified_entry *ent;
+	struct hashmap_iter iter;
+
+	if (!lm->rev.bloom_filter_settings)
+		return 1;
+
+	filter = get_bloom_filter(lm->rev.repo, origin);
+	if (!filter)
+		return 1;
+
+	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+		if (bloom_filter_contains(filter, &ent->key,
+					  lm->rev.bloom_filter_settings))
+			return 1;
+	}
+	return 0;
+}
This function is basically the same as `maybe_changed_paths()` in
"blame.c", but that isn't a huge issue from my point of view. What makes
me wonder though is why we have an additional check over there for
whether or not the commit has a valid generation number.
quoted hunk ↗ jump to hunk
@@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
 		return argc;
 	}
 
+	prepare_commit_graph(lm->rev.repo);
+	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
So this here is why we export `prepare_commit_graph()`. How about we
instead expose `bloom_filters_enabled()` that mirrors what we do in
`generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
That would both be on a higher level and do exactly what we want to
achieve.

Patrick
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help