Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone"... | git

We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Stefan Beller <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Stefan Beller <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-08
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-08
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-08
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-08
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-09
Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) · Derrick Stolee <hidden> · 2018-10-09
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-09
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) · Jeff King <hidden> · 2018-10-09
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) · Derrick Stolee <hidden> · 2018-10-09
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) · Jeff King <hidden> · 2018-10-09
Re: Bloom Filters · Jeff King <hidden> · 2018-10-09
[PoC -- do not apply 1/3] initial tree-bitmap proof of concept · Jeff King <hidden> · 2018-10-09
[PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode · Jeff King <hidden> · 2018-10-09
[PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding · Jeff King <hidden> · 2018-10-09
Re: Bloom Filters · Derrick Stolee <hidden> · 2018-10-11
Re: Bloom Filters · Jeff King <hidden> · 2018-10-11
Re: We should add a "git gc --auto" after "git clone" due to commit graph · SZEDER Gábor <hidden> · 2018-10-09
[PATCH 0/4] Bloom filter experiment · SZEDER Gábor <hidden> · 2018-10-09
[PATCH 1/4] Add a (very) barebones Bloom filter implementation · SZEDER Gábor <hidden> · 2018-10-09
[PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit · SZEDER Gábor <hidden> · 2018-10-09
Re: [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit · Jeff King <hidden> · 2018-10-09
Re: [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit · SZEDER Gábor <hidden> · 2018-10-09
[PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks · SZEDER Gábor <hidden> · 2018-10-09
[PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics · SZEDER Gábor <hidden> · 2018-10-09
Re: [PATCH 0/4] Bloom filter experiment · Derrick Stolee <hidden> · 2018-10-09
[PATCH 0/2] Per-commit filter proof of concept · Jonathan Tan <hidden> · 2018-10-11
[PATCH 1/2] One filter per commit · Jonathan Tan <hidden> · 2018-10-11
Re: [PATCH 1/2] One filter per commit · Derrick Stolee <hidden> · 2018-10-11
[PATCH] Per-commit and per-parent filters for 2 parents · Jonathan Tan <hidden> · 2018-10-11
[PATCH 2/2] Only make bloom filter for first parent · Jonathan Tan <hidden> · 2018-10-11
Re: [PATCH 0/2] Per-commit filter proof of concept · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-11
Re: [PATCH 0/4] Bloom filter experiment · Derrick Stolee <hidden> · 2018-10-15
Re: [PATCH 0/4] Bloom filter experiment · Jonathan Tan <hidden> · 2018-10-16
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Duy Nguyen <hidden> · 2018-10-03
Re: We should add a "git gc --auto" after "git clone" due to commit graph · Duy Nguyen <hidden> · 2018-10-03
[RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-04
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Derrick Stolee <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Ævar Arnfjörð Bjarmason <hidden> · 2018-10-05
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph · Jeff King <hidden> · 2018-10-05

Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

From: Jeff King <hidden>
Date: 2018-10-09 21:14:53

On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:

quoted

I wonder if Roaring does better here.

In these sparse cases, usually Roaring will organize the data as "array
chunks" which are simply lists of the values. The thing that makes this
still compressible is that we store two bytes per entry, as the entries are
grouped by a common most-significant two bytes. SInce you say ~120k unique
paths, the Roaring bitmap would have two or three chunks per bitmap (and
those chunks could be empty). The overhead to store the chunk positions,
types, and lengths does come at a cost, but it's more like 32 bytes _per
commit_.

Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.

quoted

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.

The patch below gives me a starting point to try the Bloom filter approach
and see what the numbers are like. You did all the "git" stuff like
computing the changed paths, so thanks!

Great, I hope it can be useful. I almost wrote it as perl consuming the
output of "log --format=%h --name-only", but realized I didn't have a
perl ewah implementation handy.

You'll probably want to tweak this part:

quoted

+	prepare_revision_walk(&revs);
+	while ((commit = get_revision(&revs))) {
+		data.commit = commit;
+		diff_tree_combined_merge(commit, 0, &revs);
+	}

...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:

  while ((commit = get_revision(&revs))) {
	struct object_id *parent_oid;

	/* ignore non-first parents, but handle root commits like --root */
	if (commit->parents)
		parent = &commit->parents->item->object.oid;
	else
		parent = the_hash_algo->empty_tree;

	diff_tree_oid(parent, &commit->oid, ...);
  }

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help