Thread (65 messages) 65 messages, 7 authors, 2018-10-16

Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

From: Jeff King <hidden>
Date: 2018-10-09 21:14:53

On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:
quoted
I wonder if Roaring does better here.
In these sparse cases, usually Roaring will organize the data as "array
chunks" which are simply lists of the values. The thing that makes this
still compressible is that we store two bytes per entry, as the entries are
grouped by a common most-significant two bytes. SInce you say ~120k unique
paths, the Roaring bitmap would have two or three chunks per bitmap (and
those chunks could be empty). The overhead to store the chunk positions,
types, and lengths does come at a cost, but it's more like 32 bytes _per
commit_.
Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.
quoted
So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter approach
and see what the numbers are like. You did all the "git" stuff like
computing the changed paths, so thanks!
Great, I hope it can be useful. I almost wrote it as perl consuming the
output of "log --format=%h --name-only", but realized I didn't have a
perl ewah implementation handy.

You'll probably want to tweak this part:
quoted
+	prepare_revision_walk(&revs);
+	while ((commit = get_revision(&revs))) {
+		data.commit = commit;
+		diff_tree_combined_merge(commit, 0, &revs);
+	}
...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:

  while ((commit = get_revision(&revs))) {
	struct object_id *parent_oid;

	/* ignore non-first parents, but handle root commits like --root */
	if (commit->parents)
		parent = &commit->parents->item->object.oid;
	else
		parent = the_hash_algo->empty_tree;

	diff_tree_oid(parent, &commit->oid, ...);
  }

-Peff
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help