Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
From: Jeff King <hidden>
Date: 2018-10-09 21:14:53
On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:
quoted
I wonder if Roaring does better here.In these sparse cases, usually Roaring will organize the data as "array chunks" which are simply lists of the values. The thing that makes this still compressible is that we store two bytes per entry, as the entries are grouped by a common most-significant two bytes. SInce you say ~120k unique paths, the Roaring bitmap would have two or three chunks per bitmap (and those chunks could be empty). The overhead to store the chunk positions, types, and lengths does come at a cost, but it's more like 32 bytes _per commit_.
Hmph. It really sounds like we could do better with a custom RLE solution. But that makes me feel like I'm missing something, because surely I can't invent something better than the state of the art in a simple thought experiment, right? I know what I'm proposing would be quite bad for random access, but my impression is that EWAH is the same. For the scale of bitmaps we're talking about, I think linear/streaming access through the bitmap would be OK.
quoted
So at any rate, I do think it would not be out of the question to store bitmaps like this. I'm much more worried about the maintenance cost of adding new entries incrementally. I think it's only feasible if we give up sorting, and then I wonder what other problems that might cause.The patch below gives me a starting point to try the Bloom filter approach and see what the numbers are like. You did all the "git" stuff like computing the changed paths, so thanks!
Great, I hope it can be useful. I almost wrote it as perl consuming the output of "log --format=%h --name-only", but realized I didn't have a perl ewah implementation handy. You'll probably want to tweak this part:
quoted
+ prepare_revision_walk(&revs); + while ((commit = get_revision(&revs))) { + data.commit = commit; + diff_tree_combined_merge(commit, 0, &revs); + }
...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:
while ((commit = get_revision(&revs))) {
struct object_id *parent_oid;
/* ignore non-first parents, but handle root commits like --root */
if (commit->parents)
parent = &commit->parents->item->object.oid;
else
parent = the_hash_algo->empty_tree;
diff_tree_oid(parent, &commit->oid, ...);
}
-Peff