Thread (79 messages) 79 messages, 10 authors, 2025-08-05

Re: [PATCH v6 4/4] last-modified: use Bloom filters when available

From: Toon Claes <hidden>
Date: 2025-08-01 16:23:22

Patrick Steinhardt [off-list ref] writes:
On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
quoted
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.

When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.

If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.

Comparing the perf test results on git.git:

    Test                                        HEAD~             HEAD
    ------------------------------------------------------------------------------------
    8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
    8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
    8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%
Nice results.
quoted
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index e4c73464c7..19bf25f8a5 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
 	}
 }
 
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+	struct bloom_filter *filter;
+	struct last_modified_entry *ent;
+	struct hashmap_iter iter;
+
+	if (!lm->rev.bloom_filter_settings)
+		return 1;
+
+	filter = get_bloom_filter(lm->rev.repo, origin);
+	if (!filter)
+		return 1;
+
+	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+		if (bloom_filter_contains(filter, &ent->key,
+					  lm->rev.bloom_filter_settings))
+			return 1;
+	}
+	return 0;
+}
This function is basically the same as `maybe_changed_paths()` in
"blame.c", but that isn't a huge issue from my point of view. What makes
me wonder though is why we have an additional check over there for
whether or not the commit has a valid generation number.
I've been asking me the same question. And I couldn't find a good reason
(neither from the commit history, or from my reasoning). This check was
in the version shared by Taylor, but because we were ignoring the return
value from generation_numbers_enabled() in that version, it didn't make
sense to me to do this check. That's why I removed it.
quoted
@@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
 		return argc;
 	}
 
+	prepare_commit_graph(lm->rev.repo);
+	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
So this here is why we export `prepare_commit_graph()`. How about we
instead expose `bloom_filters_enabled()` that mirrors what we do in
`generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
That would both be on a higher level and do exactly what we want to
achieve.
I've got another proposal, what if we let get_bloom_filter_settings()
call prepare_commit_graph()? Functions like
repo_find_commit_pos_in_graph() and lookup_commit_in_graph() do this
too. 

-- 
Cheers,
Toon
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help