Thread (17 messages) 17 messages, 5 authors, 2022-10-31

Re: [PATCH 0/5] [RFC] introduce Roaring bitmaps to Git

From: Derrick Stolee <hidden>
Date: 2022-09-19 18:18:24

On 9/19/2022 1:47 PM, Abhradeep Chakraborty via GitGitGadget wrote:
Git currently uses ewah bitmaps ( which are based on run-length encoding) to
compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length
words i.e. instead of storing each and every bit, it tries to find
consecutive bits (having same value) and replace them with the value bit and
the range upto which the bit is present. It is simple and efficient. But one
downside of this approach is that we have to decompress the whole bitmap in
order to find the bit of a certain position.

For small (or medium sized) bitmaps, this is not an issue. But it can be an
issue for large (or extra large) bitmaps. In that case roaring bitmaps are
generally more efficient[1] than ewah itself. Some benchmarks suggests that
roaring bitmaps give more performance benefits than ewah or any other
similar compression technique.

This patch series is currently in RFC state and it aims to let Git use
roaring bitmaps. As this is an RFC patch series (for now), the code are not
fully accurate (i.e. some tests are failing). But it is backward-compatible
(tests related to ewah bitmaps are passing). Some commit messages might need
more explanation and some commits may need a split (specially the one that
implement writing roaring bitmaps). Overall, the structure and code are near
to ready to make the series a formal patch series.

I am submitting it as an RFC (after discussions with mentors) because the
GSoC coding period is about to end. I will continue to work on the patch
series.
I look forward to your next version. I hope to see some information about
the performance characteristics across the two versions. Specifically:

1. How do various test in t/perf/ change between the two formats?
2. For certain test repos (git/git, torvalds/linux, etc.) how much does
   the .bitmap file change in size across the formats?
 
 Makefile                   |     3 +
 bitmap.c                   |   225 +
 bitmap.h                   |    33 +
...
 ewah/bitmap.c              |    61 +-
 ewah/ewok.h                |    37 +-
...
 roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
 roaring/roaring.h          |  1028 ++
I wonder if there is value in modifying the structure of these files
into a bitmap/ directory and then perhaps ewah/ and roaring/ within
each? Just a thought.

Thanks,
-Stolee
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help