Re: [PATCH v6 4/4] last-modified: use Bloom filters when available
From: Toon Claes <hidden>
Date: 2025-08-01 16:23:22
Patrick Steinhardt [off-list ref] writes:
On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:quoted
Our 'git last-modified' performs a revision walk, and computes a diff at each point in the walk to figure out whether a given revision changed any of the paths it considers interesting. When changed-path Bloom filters are available, we can avoid computing many such diffs. Before computing a diff, we first check if any of the remaining paths of interest were possibly changed at a given commit by consulting its Bloom filter. If any of them are, we are resigned to compute the diff. If none of those queries returned "maybe", we know that the given commit doesn't contain any changed paths which are interesting to us. So, we can avoid computing it in this case. Comparing the perf test results on git.git: Test HEAD~ HEAD ------------------------------------------------------------------------------------ 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6% 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4% 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%Nice results.quoted
diff --git a/builtin/last-modified.c b/builtin/last-modified.c index e4c73464c7..19bf25f8a5 100644 --- a/builtin/last-modified.c +++ b/builtin/last-modified.c@@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q, } } +static int maybe_changed_path(struct last_modified *lm, struct commit *origin) +{ + struct bloom_filter *filter; + struct last_modified_entry *ent; + struct hashmap_iter iter; + + if (!lm->rev.bloom_filter_settings) + return 1; + + filter = get_bloom_filter(lm->rev.repo, origin); + if (!filter) + return 1; + + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) { + if (bloom_filter_contains(filter, &ent->key, + lm->rev.bloom_filter_settings)) + return 1; + } + return 0; +}This function is basically the same as `maybe_changed_paths()` in "blame.c", but that isn't a huge issue from my point of view. What makes me wonder though is why we have an additional check over there for whether or not the commit has a valid generation number.
I've been asking me the same question. And I couldn't find a good reason (neither from the commit history, or from my reasoning). This check was in the version shared by Taylor, but because we were ignoring the return value from generation_numbers_enabled() in that version, it didn't make sense to me to do this check. That's why I removed it.
quoted
@@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r, return argc; } + prepare_commit_graph(lm->rev.repo); + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo); +So this here is why we export `prepare_commit_graph()`. How about we instead expose `bloom_filters_enabled()` that mirrors what we do in `generation_numbers_enabled()` and `corrected_commit_dates_enabled()`? That would both be on a higher level and do exactly what we want to achieve.
I've got another proposal, what if we let get_bloom_filter_settings() call prepare_commit_graph()? Functions like repo_find_commit_pos_in_graph() and lookup_commit_in_graph() do this too. -- Cheers, Toon