[PATCH 00/17] midx: incremental MIDX/bitmap layer compaction
From: Taylor Blau <hidden>
Date: 2025-12-06 20:31:00
Note to the maintainer:
* This series is the spiritual successor to my earlier series queued
as 'tb/incremental-midx-part-3.1', but is based on 'master'
(currently f0ef5b6d9bc (The fifth batch, 2025-11-30) at the time of
writing). I suggest queueing as 'tb/incremental-midx-part-3.2'.
OVERVIEW
========
This series implements the second of three major components for an
incremental MIDX/bitmap-based repacking strategy. Those components
are:
1. Refactoring code out of builtin/repack.c into individual
compilation units outside of the builtin directory.
2. (this series) Implementing MIDX layer compaction, i.e. the ability
to take an contiguous sequence of MIDX layers in an existing MIDX
chain and replace them with a single layer representing the union
of objects/packs among the compacted layers.
3. An incremental repacking strategy that, unlike our current
'--geometric' approach, does not require periodic all-into-one
repacks.
BACKGROUND
==========
I'll briefly discuss that repacking strategy here in order to provide
some context on why MIDX layer compaction is important. When doing a
--geometric repack today, any update to the MIDX and/or its .bitmap
requires a full rewrite. For most repositories this is not such a big
deal, but in large monorepos these costs can add up significantly over
time.
To deal with this, incremental MIDXs (and corresponding support for
reachability bitmaps) were introduced as a way to append updates to
the end of a MIDX chain without having to rewrite anything.
Of course, growing a MIDX chain infinitely without ever shrinking it
would be pointless. So any repacking strategy which leverages
incremental MIDXs must take some action which encourages the MIDX
chain to avoid growing too large.
The strategy that I will share in the forthcoming topic related to
this effort does (roughly) the following:
1. Repack all non-MIDX'd packs geometrically, optionally including
any packs in the most recent MIDX layer iff it has more than some
threshold of packs.
2. Write out a new MIDX layer which is the result of the previous
step, adding it onto the end of the chain (or replacing the
previous end, depending on whether the pack threshold condition
was met).
3. Perform a compaction pass, combining adjacent MIDX layers to
restore a geometric progression on object count among MIDX layers
in the chain.
(As an aside, this description is not the most efficient way to
implement this behavior, and so the actual implementation varies
slightly, though achieves the same effect.)
This strategy encourages MIDX chains where older layers have fewer,
larger packs, and newer layers have a greater number of smaller packs.
The compaction pass prevents the MIDX chain from ever growing too
large. (In my simulation[1], even a large number of simulated pushes
yields a MIDX chain which is consistently fewer than ~10 or so layers
long.)
COMPACTION
==========
MIDX compaction works by producing a new MIDX layer which is the
logical union of any contiguous sequence within the existing MIDX
chain. Importantly, the packs/objects from those layers within the
compaction range have their order preserved. That means that the
pseudo-pack order of the compacted layer is guaranteed to be the same
as the concatenated pseudo-pack order across all layers in the
compaction range.
As a result, we can reuse any selected bitmaps within the compaction
range verbatim. More importantly, any bitmaps belonging to layers
which are descendants of the compacted one do not need to permute any
existing bitmaps in order for them to be valid post-compaction.
This allows us to cheaply combine parts of the MIDX chain without
needing to do any work outside of the combine layers, which is
essential in the above-described repacking strategy.
THIS SERIES
===========
This series implements MIDX compaction, and is organized roughly as
follows:
* The first two patches are stray code clean-up and minor refactoring
changes that I made while writing this series.
* The next three patches clean up the SYNOPSIS for in the manual page
for git-multi-pack-index(1), and removes the builtin from the list
of known failures in t0450.
* The next patch is a typofix, and the one following that is a
preemptive bugfix.
* The eighth patch is a code refactoring patch that makes the
argument list for midx-write.c::write_midx_internal() easier to
reason about.
* The ninth patch relaxes a MIDX constraint that forces all packs
within a layer to be sorted lexicographically.
* The remaining patches prepare for and introduce MIDX compaction.
The test coverage in the new t5335-compact-multi-pack-index.sh is
fairly sparse in my opinion, though most of the interesting behavior
is already exercised via t5319 and t5334. If others have ideas on how
to improve test coverage here, please let me know!
This series is somewhat lengthy, though the individual changes are
relatively small for the most part, so I am hoping that they aren't
too bad to review.
Thanks in advance for your review!
[1]: https://gist.github.com/ttaylorr/7ac0305f130197744f3583096c8198a1
Taylor Blau (17):
midx: mark `get_midx_checksum()` arguments as const
midx: split `get_midx_checksum()` by adding `get_midx_hash()`
builtin/multi-pack-index.c: make '--progress' a common option
git-multi-pack-index(1): remove non-existent incompatibility
git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h'
t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39
midx-write.c: don't use `pack_perm` when assigning `bitmap_pos`
midx-write.c: introduce `struct write_midx_opts`
midx: do not require packs to be sorted in lexicographic order
git-compat-util.h: introduce `u32_add()`
midx-write.c: introduce `midx_pack_perm()` helper
midx-write.c: extract `fill_pack_from_midx()`
midx-write.c: enumerate `pack_int_id` values directly
midx-write.c: factor fanout layering from `compute_sorted_entries()`
t/helper/test-read-midx.c: plug memory leak when selecting layer
midx: implement MIDX compaction
midx: enable reachability bitmaps during MIDX compaction
Documentation/git-multi-pack-index.adoc | 24 +-
builtin/multi-pack-index.c | 84 ++++-
git-compat-util.h | 8 +
midx-write.c | 463 ++++++++++++++++++------
midx.c | 36 +-
midx.h | 9 +-
pack-bitmap.c | 9 +-
pack-revindex.c | 4 +-
t/helper/test-read-midx.c | 21 +-
t/meson.build | 1 +
t/t0450/adoc-help-mismatches | 1 -
t/t5319-multi-pack-index.sh | 7 +-
t/t5335-compact-multi-pack-index.sh | 218 +++++++++++
13 files changed, 737 insertions(+), 148 deletions(-)
create mode 100755 t/t5335-compact-multi-pack-index.sh
base-commit: f0ef5b6d9bcc258e4cbef93839d1b7465d5212b9
--
2.52.0.171.gd6a4e6b6955