Thread (17 messages) 17 messages, 5 authors, 2022-10-31

Re: [PATCH 3/5] roaring: teach Git to write roaring bitmaps

From: Junio C Hamano <hidden>
Date: 2022-09-30 06:21:04

"Abhradeep Chakraborty via GitGitGadget" [off-list ref]
writes:
From: Abhradeep Chakraborty <redacted>

Roaring bitmaps are said to be more efficient (most of the time) than
ewah bitmaps. So Git might gain some optimization if it support roaring
bitmaps. As Roaring library has all the changes it needed to implement
roaring bitmaps in Git, Git can learn to write roaring bitmaps. However,
all the changes are backward-compatible.

Teach Git to write roaring bitmaps.
That is way underexplained.   At least cover what the plans are, so
that readers do not have to ask these questions:

 * When is the choice of bitmap type is made?  Is it fixed at
   repository initialization time and once chosen other kinds cannot
   be used?

 * Is the bitmap file self describing?  How does a reader know
   between ewah and roaring codepaths to use to read a given bitmap
   file?  Is there enough room for extending the set of bitmap
   formats, or we cannot add other formats easily?
quoted hunk ↗ jump to hunk
Mentored-by: Taylor Blau [off-list ref]
Mentored-by: Kaartic Sivaraam [off-list ref]
Signed-off-by: Abhradeep Chakraborty <redacted>
---
 Makefile                |   1 +
 bitmap.c                | 225 +++++++++++++++++++++++++++
 bitmap.h                |  33 ++++
 builtin/diff.c          |  10 +-
 ewah/bitmap.c           |  61 +++++---
 ewah/ewok.h             |  37 ++---
 pack-bitmap-write.c     | 326 ++++++++++++++++++++++++++++++----------
 pack-bitmap.c           | 114 +++++++-------
 pack-bitmap.h           |  22 ++-
 t/t5310-pack-bitmaps.sh |  17 +++
 10 files changed, 664 insertions(+), 182 deletions(-)
 create mode 100644 bitmap.c
 create mode 100644 bitmap.h
diff --git a/Makefile b/Makefile
index e9537951105..9ca19b3ca8d 100644
--- a/Makefile
+++ b/Makefile
@@ -900,6 +900,7 @@ LIB_OBJS += archive.o
 LIB_OBJS += attr.o
 LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
+LIB_OBJS += bitmap.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
diff --git a/bitmap.c b/bitmap.c
new file mode 100644
index 00000000000..7d547eb9f53
--- /dev/null
+++ b/bitmap.c
@@ -0,0 +1,225 @@
+#include "bitmap.h"
+#include "cache.h"
+
+static enum bitmap_type bitmap_type = INIT_BITMAP_TYPE;
"INIT" is a strange name for "UNINITIALIZED".  Especially ...
+void *roaring_or_ewah_bitmap_init(void)
+{
+	switch (bitmap_type)
+	{
(Style)
+	case EWAH:
+		return ewah_new();
+	case ROARING:
+		return roaring_bitmap_create();
+	default:
... here, you use it to mean exactly that.
+		error(_("bitmap type not initialized\n"));
+		return NULL;
Do you really need the global variable that holds the bitmap type?

Wouldn't it be easier to write code that needs to deal with both
types (e.g. in a repository with existing ewah bitmap, you want to
do a repack and index the result using the roaring bitmap) if you
passed the type through the callchain as a parameter?

It may be that the codepath that reads from an existing bitmap file
says "ah, the file given to us seems to be in format X (either EWAH
or ROARING or perhaps something else), so let's call bitmap_init(X)
to obtain the in-core data structure to deal with that file".  When
that happens, you may probably need to have two cases in the default:
arm of this switch statement, i.e. one to diagnose a BUG() to pass
an uninitialized bitmap type to the codepath, and the other to
diagnose a runtime error() to have read a bitmap file whose format
this version of Git does not understand.
+void *roaring_or_raw_bitmap_copy(void *bitmap)
+{
+	switch (bitmap_type)
+	{
+	case EWAH:
...
+int roaring_or_ewah_bitmap_set(void *bitmap, uint32_t i)
+{
+	switch (bitmap_type) {
+	case EWAH:
+...
+void roaring_or_raw_bitmap_set(void *bitmap, uint32_t i)
+{
+	switch (bitmap_type)
+	{
+	case EWAH:
+...
+void roaring_or_raw_bitmap_unset(void *bitmap, uint32_t i)
+{
+	switch (bitmap_type)
+	{
+	case EWAH:
+...
These repetitive patterns makes me wonder if void *bitmap
is a good type to be passing around.  Shouldn't it be a struct with
its first member being a bitmap_type, and another member being what
these functions are passing to the underlying bitmap format specific
functions as "bitmap"?  E.g.

    void bitmap_unset(struct bitmap *bm, uint32_t i)
    {
	switch (bm->type) {
	case EWAH:
		ewah_bitmap_remove(bm->u.ewah, i);
		break;
	...

\ No newline at end of file
Careful.
quoted hunk ↗ jump to hunk
diff --git a/bitmap.h b/bitmap.h
new file mode 100644
index 00000000000..d75400922cc
--- /dev/null
+++ b/bitmap.h
@@ -0,0 +1,33 @@
+#ifndef __BITMAP_H__
+#define __BITMAP_H__
+
+
+#include "git-compat-util.h"
+#include "ewah/ewok.h"
+#include "roaring/roaring.h"
+
+enum bitmap_type {
+	INIT_BITMAP_TYPE = 0,
"UNINITIALIZED_BITMAP_TYPE", probably.
+void *roaring_or_ewah_bitmap_init(void);
I would strongly suggest reconsider these names.  What if you later
want to add the third variant?  roaring_or_ewah_or_xyzzy_bitmap_init()?

Instead just use the most generic name, like "bitmap_init", perhaps
something along the lines of ...

    struct bitmap {
	enum bitmap_type type;
	union {
	    struct ewah_bitmap *ewah;
	    struct roaring_bitmap *roaring;
	} u;	
    };

    struct bitmap *bitmap_new(enum bitmap_type type)
    {
	struct bitmap *bm = xmalloc(sizeof(*bm));

	bm->type = type;
	switch (bm->type) {
	case EWAH:
	    bm->u.ewah = ewah_new();
	    break;
	case ROARING:
	    bm->u.roaring = roaring_bitmap_create();
	    break;
        default:
	    die(_("unknown bitmap type %d"), (int)type);
	}
	return bm;
    }

I dunno.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help