Thread (27 messages) 27 messages, 4 authors, 2025-06-03

Re: [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles

From: Taylor Blau <hidden>
Date: 2025-05-23 17:46:25
Subsystem: the rest · Maintainer: Linus Torvalds

Possibly related (same subject, not in this thread)

On Thu, May 22, 2025 at 10:08:20PM -0400, Jeff King wrote:
quoted
(As an aside, I think I recall you suggesting a while ago that it might
be interesting to define "global" things with a different type than
"local" ones to prevent this sort of confusion. That would allow us to
keep both "pack_int_id" and the return value of "midx_for_pack()" in
scope at the same time, without the possibility of using one when you
meant to use the other.)
Yeah. The trouble is that it becomes awkward in C, since the language
will happily intermix two integer typedefs. So you have to wrap them in
a struct and access the struct fields everywhere.
Yup :-<.
quoted
quoted
  - There's a similar case in midx-write.c:want_included_pack(). That
    one seems to have the same local/global confusion, but I do not
    obviously see anything preventing it from being fed a non-base midx.
    So it might possibly be buggy?
Yeah, this spot is definitely broken. At minimum it would need something
like:
--- 8< ---
diff --git a/midx-write.c b/midx-write.c
index 0897cbd829..54a04f7b75 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1634,9 +1634,10 @@ static int want_included_pack(struct repository *r,
 			      uint32_t pack_int_id)
 {
 	struct packed_git *p;
+	midx_for_pack(&m, pack_int_id);
 	if (prepare_midx_pack(r, m, pack_int_id))
 		return 0;
-	p = m->packs[pack_int_id];
+	p = m->packs[pack_int_id - m->num_packs_in_base];
 	if (!pack_kept_objects && p->pack_keep)
 		return 0;
 	if (p->is_cruft)
--- >8 ---
Yep, that's exactly what I was envisioning. I guess it's probably
possible to trigger in practice by writing a new midx based on an
existing incremental state. I'll let you figure that part out. :)
Hmm. I thought that this spot was broken last night, but looking at it
again today I think that it is actually OK. I started writing an
analysis of why in response to your email, and then decided to throw it
away since those details really should live in the patch message. Here's
what I came up with:
--- 8< ---
Subject: [PATCH] midx-write.c: guard against incremental MIDXs in
 want_included_pack()

The function want_included_pack() is used to determine whether or not a
the packfile corresponding to some given pack_int_id should be included
in a 'git multi-pack-index repack' operation.

This spot looks like it would be broken, particularly in:

    struct packed_git *p;
    if (prepare_midx_pack(r, m, pack_int_id))
        return 0;
    p = m->packs[pack_int_id];

, when pack_int_id is greater than m->num_pack_in_base (supposing that
m->num_packs_in_base is non-zero, or equivalently that m->base_midx is
non-NULL).

Suppose we have two MIDXs in an incremental MIDX chain, each having two
packs:

  - M0 = {packs: [P0, P1], base_midx: NULL}
  - M1 = {packs: [P2, P3], base_midx: M0}

noting that each pack is identified by its global pack_int_id within the
chain.

So if you do something like:

    want_included_pack(the_repository, M1, pack_kept_objects, 2);

that function will call prepare_midx_pack(), which is smart enough to
realize that the pack of interest is in the current layer (M1), and
knows how to adjust its global pack_int_id into an index into the
current layer's 'packs' array.

But the following line:

    p = m->packs[pack_int_id]; /* m is M1, pack_int_id is 2 */

looks broken, since each layer of the MIDX only maintains an array of
the packs stored within that layer, and 'm' wasn't adjusted back to
point at M1->base_midx (M0).

The right thing to do would be:

    struct packed_git *p;
    if (prepare_midx_pack(r, m, pack_int_id))
        return 0;

    /* open-code midx.c::midx_for_pack(), which is private */
    while (m && pack_int_id < m->num_packs_in_base)
        m = m->base_midx;

    if (!m)
        BUG("broken midx?");
    if (pack_int_id >= m->num_packs + m->num_packs_in_base)
        BUG("broken pack_int_id?");

    p = m->packs[pack_int_id - m->num_packs_in_base];

But that would be overkill, since this function never deals with
incremental MIDXs having more than one layer! To see why, observe that
want_included_pack() has two callers:

  - midx-write.c::fill_included_packs_all()
  - midx-write.c::fill_included_packs_batch()

and those functions' sole caller is in midx-write.c::midx_repack(),
which dispatches a call to one or the other depending on whether or not
the batch_size is non-zero.

And at the very top of midx_repack(), we have a guard against
non-trivial incremental MIDX chains:

    if (m->base_midx)
        die(_("cannot repack an incremental multi-pack-index"));

So want_included_pack() is OK becuase it will never encounter a
situation where it has to chase backwards through the '->base_midx'
pointer. But that is not immediately clear from reading the code, and is
too fragile for my comfort. Make this more clear by adding an ASSERT()
to the above effect.

Signed-off-by: Taylor Blau <redacted>
---
 midx-write.c | 7 +++++++
 1 file changed, 7 insertions(+)
diff --git a/midx-write.c b/midx-write.c
index 0897cbd829..991c42d1dc 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1634,6 +1634,9 @@ static int want_included_pack(struct repository *r,
 			      uint32_t pack_int_id)
 {
 	struct packed_git *p;
+
+	ASSERT(m && !m->base_midx);
+
 	if (prepare_midx_pack(r, m, pack_int_id))
 		return 0;
 	p = m->packs[pack_int_id];
@@ -1653,6 +1656,8 @@ static void fill_included_packs_all(struct repository *r,
 	uint32_t i;
 	int pack_kept_objects = 0;

+	ASSERT(m && !m->base_midx);
+
 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);

 	for (i = 0; i < m->num_packs; i++) {
@@ -1673,6 +1678,8 @@ static void fill_included_packs_batch(struct repository *r,
 	struct repack_info *pack_info;
 	int pack_kept_objects = 0;

+	ASSERT(m && !m->base_midx);
+
 	CALLOC_ARRAY(pack_info, m->num_packs);

 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);
--
2.49.0.221.g485f5f8636.dirty
--- >8 ---
quoted
quoted
    Likewise fill_included_packs_batch() in the same file.
I think this one is actually OK for the same reason as the
expire_midx_packs() case. Its sole caller in midx_repack() has:

    if (m->base_midx)
        die(_("cannot repack an incremental multi-pack-index"));

, so we are OK there. We might want to add an ASSERT() in
fill_included_packs_batch() to make it clearer, though.
Ah, good. Yes, I agree that an assertion would be a nice bit of
documentation/safety. Though for these cases I think they only care
about the pack (not the containing midx), so changing the return type of
prepare_midx_pack() might be an even easier way to get that safety.
Ironically, the same argument applies for this function (and the _all()
variant) as well ;-).

Thanks,
Taylor
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help