Re: [PATCH 3/5] roaring: teach Git to write roaring bitmaps
From: Junio C Hamano <hidden>
Date: 2022-09-30 06:21:04
"Abhradeep Chakraborty via GitGitGadget" [off-list ref] writes:
From: Abhradeep Chakraborty <redacted> Roaring bitmaps are said to be more efficient (most of the time) than ewah bitmaps. So Git might gain some optimization if it support roaring bitmaps. As Roaring library has all the changes it needed to implement roaring bitmaps in Git, Git can learn to write roaring bitmaps. However, all the changes are backward-compatible. Teach Git to write roaring bitmaps.
That is way underexplained. At least cover what the plans are, so that readers do not have to ask these questions: * When is the choice of bitmap type is made? Is it fixed at repository initialization time and once chosen other kinds cannot be used? * Is the bitmap file self describing? How does a reader know between ewah and roaring codepaths to use to read a given bitmap file? Is there enough room for extending the set of bitmap formats, or we cannot add other formats easily?
quoted hunk ↗ jump to hunk
Mentored-by: Taylor Blau [off-list ref] Mentored-by: Kaartic Sivaraam [off-list ref] Signed-off-by: Abhradeep Chakraborty <redacted> --- Makefile | 1 + bitmap.c | 225 +++++++++++++++++++++++++++ bitmap.h | 33 ++++ builtin/diff.c | 10 +- ewah/bitmap.c | 61 +++++--- ewah/ewok.h | 37 ++--- pack-bitmap-write.c | 326 ++++++++++++++++++++++++++++++---------- pack-bitmap.c | 114 +++++++------- pack-bitmap.h | 22 ++- t/t5310-pack-bitmaps.sh | 17 +++ 10 files changed, 664 insertions(+), 182 deletions(-) create mode 100644 bitmap.c create mode 100644 bitmap.hdiff --git a/Makefile b/Makefile index e9537951105..9ca19b3ca8d 100644 --- a/Makefile +++ b/Makefile@@ -900,6 +900,7 @@ LIB_OBJS += archive.o LIB_OBJS += attr.o LIB_OBJS += base85.o LIB_OBJS += bisect.o +LIB_OBJS += bitmap.o LIB_OBJS += blame.o LIB_OBJS += blob.o LIB_OBJS += bloom.odiff --git a/bitmap.c b/bitmap.c new file mode 100644 index 00000000000..7d547eb9f53 --- /dev/null +++ b/bitmap.c@@ -0,0 +1,225 @@ +#include "bitmap.h" +#include "cache.h" + +static enum bitmap_type bitmap_type = INIT_BITMAP_TYPE;
"INIT" is a strange name for "UNINITIALIZED". Especially ...
+void *roaring_or_ewah_bitmap_init(void)
+{
+ switch (bitmap_type)
+ {(Style)
+ case EWAH: + return ewah_new(); + case ROARING: + return roaring_bitmap_create(); + default:
... here, you use it to mean exactly that.
+ error(_("bitmap type not initialized\n"));
+ return NULL;Do you really need the global variable that holds the bitmap type? Wouldn't it be easier to write code that needs to deal with both types (e.g. in a repository with existing ewah bitmap, you want to do a repack and index the result using the roaring bitmap) if you passed the type through the callchain as a parameter? It may be that the codepath that reads from an existing bitmap file says "ah, the file given to us seems to be in format X (either EWAH or ROARING or perhaps something else), so let's call bitmap_init(X) to obtain the in-core data structure to deal with that file". When that happens, you may probably need to have two cases in the default: arm of this switch statement, i.e. one to diagnose a BUG() to pass an uninitialized bitmap type to the codepath, and the other to diagnose a runtime error() to have read a bitmap file whose format this version of Git does not understand.
+void *roaring_or_raw_bitmap_copy(void *bitmap)
+{
+ switch (bitmap_type)
+ {
+ case EWAH:
...
+int roaring_or_ewah_bitmap_set(void *bitmap, uint32_t i)
+{
+ switch (bitmap_type) {
+ case EWAH:
+...
+void roaring_or_raw_bitmap_set(void *bitmap, uint32_t i)
+{
+ switch (bitmap_type)
+ {
+ case EWAH:
+...
+void roaring_or_raw_bitmap_unset(void *bitmap, uint32_t i)
+{
+ switch (bitmap_type)
+ {
+ case EWAH:
+...
These repetitive patterns makes me wonder if void *bitmap
is a good type to be passing around. Shouldn't it be a struct with
its first member being a bitmap_type, and another member being what
these functions are passing to the underlying bitmap format specific
functions as "bitmap"? E.g.
void bitmap_unset(struct bitmap *bm, uint32_t i)
{
switch (bm->type) {
case EWAH:
ewah_bitmap_remove(bm->u.ewah, i);
break;
...
\ No newline at end of file
Careful.
quoted hunk ↗ jump to hunk
diff --git a/bitmap.h b/bitmap.h new file mode 100644 index 00000000000..d75400922cc --- /dev/null +++ b/bitmap.h@@ -0,0 +1,33 @@ +#ifndef __BITMAP_H__ +#define __BITMAP_H__ + + +#include "git-compat-util.h" +#include "ewah/ewok.h" +#include "roaring/roaring.h" + +enum bitmap_type { + INIT_BITMAP_TYPE = 0,
"UNINITIALIZED_BITMAP_TYPE", probably.
+void *roaring_or_ewah_bitmap_init(void);
I would strongly suggest reconsider these names. What if you later
want to add the third variant? roaring_or_ewah_or_xyzzy_bitmap_init()?
Instead just use the most generic name, like "bitmap_init", perhaps
something along the lines of ...
struct bitmap {
enum bitmap_type type;
union {
struct ewah_bitmap *ewah;
struct roaring_bitmap *roaring;
} u;
};
struct bitmap *bitmap_new(enum bitmap_type type)
{
struct bitmap *bm = xmalloc(sizeof(*bm));
bm->type = type;
switch (bm->type) {
case EWAH:
bm->u.ewah = ewah_new();
break;
case ROARING:
bm->u.roaring = roaring_bitmap_create();
break;
default:
die(_("unknown bitmap type %d"), (int)type);
}
return bm;
}
I dunno.