Re: [PATCH 3/5] roaring: teach Git to write roaring bitmaps
From: Abhradeep Chakraborty <hidden>
Date: 2022-09-30 16:24:03
On Fri, Sep 30, 2022 at 11:51 AM Junio C Hamano [off-list ref] wrote:
"Abhradeep Chakraborty via GitGitGadget" [off-list ref] writes:quoted
From: Abhradeep Chakraborty <redacted> Roaring bitmaps are said to be more efficient (most of the time) than ewah bitmaps. So Git might gain some optimization if it support roaring bitmaps. As Roaring library has all the changes it needed to implement roaring bitmaps in Git, Git can learn to write roaring bitmaps. However, all the changes are backward-compatible. Teach Git to write roaring bitmaps.That is way underexplained. At least cover what the plans are, so that readers do not have to ask these questions: * When is the choice of bitmap type is made? Is it fixed at repository initialization time and once chosen other kinds cannot be used? * Is the bitmap file self describing? How does a reader know between ewah and roaring codepaths to use to read a given bitmap file? Is there enough room for extending the set of bitmap formats, or we cannot add other formats easily?
Hey Junio,
First of all, sorry that the next version is taking so much time to
land. We have a festival ("Durga Puja"; it is the biggest festival for
Bengalis) going on here now. So I am not that active.
I will explain briefly in the next version.
Do you really need the global variable that holds the bitmap type? Wouldn't it be easier to write code that needs to deal with both types (e.g. in a repository with existing ewah bitmap, you want to do a repack and index the result using the roaring bitmap) if you passed the type through the callchain as a parameter?
I didn't want to go for "passing the type through the callchain as a parameter" because that would cause changes to every affected function definition. I found the "global variable" approach simpler for this reason. Here we have to initialize the type once and the affected functions will work accordingly. If you like the "callchain" approach, I have no problem to implement it.
It may be that the codepath that reads from an existing bitmap file says "ah, the file given to us seems to be in format X (either EWAH or ROARING or perhaps something else), so let's call bitmap_init(X) to obtain the in-core data structure to deal with that file". When that happens, you may probably need to have two cases in the default: arm of this switch statement, i.e. one to diagnose a BUG() to pass an uninitialized bitmap type to the codepath, and the other to diagnose a runtime error() to have read a bitmap file whose format this version of Git does not understand.
Ok, understood. Thanks.
These repetitive patterns makes me wonder if void *bitmap
is a good type to be passing around. Shouldn't it be a struct with
its first member being a bitmap_type, and another member being what
these functions are passing to the underlying bitmap format specific
functions as "bitmap"? E.g.
void bitmap_unset(struct bitmap *bm, uint32_t i)
{
switch (bm->type) {
case EWAH:
ewah_bitmap_remove(bm->u.ewah, i);
break;
...Good idea! Thanks.
quoted
+ +enum bitmap_type { + INIT_BITMAP_TYPE = 0,"UNINITIALIZED_BITMAP_TYPE", probably.
Ok.
quoted
+void *roaring_or_ewah_bitmap_init(void);I would strongly suggest reconsider these names. What if you later want to add the third variant? roaring_or_ewah_or_xyzzy_bitmap_init()? Instead just use the most generic name, like "bitmap_init", perhaps something along the lines of ... struct bitmap { enum bitmap_type type; union { struct ewah_bitmap *ewah; struct roaring_bitmap *roaring; } u; }; struct bitmap *bitmap_new(enum bitmap_type type) { struct bitmap *bm = xmalloc(sizeof(*bm)); bm->type = type; switch (bm->type) { case EWAH: bm->u.ewah = ewah_new(); break; case ROARING: bm->u.roaring = roaring_bitmap_create(); break; default: die(_("unknown bitmap type %d"), (int)type); } return bm; }
Got it. It seems a better option than the current one. Thanks )