Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
From: Jeff King <hidden>
Date: 2025-01-18 12:15:16
Subsystem:
the rest · Maintainer:
Linus Torvalds
On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:
quoted hunk ↗ jump to hunk
diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bd..892176d23d 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; + struct hashfile_checkpoint checkpoint; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR);@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, OBJ_BLOB, size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); /* Note: idx is non-NULL when we are writing */ - if ((flags & HASH_WRITE_OBJECT) != 0) + if ((flags & HASH_WRITE_OBJECT) != 0) { CALLOC_ARRAY(idx, 1); + prepare_to_stream(state, flags); + hashfile_checkpoint_init(state->f, &checkpoint); + } + already_hashed_to = 0; while (1) {
Yeah, that's ugly. We are potentially throwing away the hashfile that the checkpoint was created for. That makes my instinct to push the checkpoint down into the loop where we we might restart a new pack, like this (and like you suggested below):
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR);
@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, already_hashed_to = 0; while (1) { + struct hashfile_checkpoint checkpoint = {0}; prepare_to_stream(state, flags); if (idx) { + hashfile_checkpoint_init(state->f, &checkpoint); hashfile_checkpoint(state->f, &checkpoint); idx->offset = state->offset; crc32_begin(state->f);
but that doesn't work, because the checkpoint is also used later for the
already_written() check:
if (already_written(state, result_oid)) {
hashfile_truncate(state->f, &checkpoint);
state->offset = checkpoint.offset;
free(idx);
} else
That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.
So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).
But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.
I do think it would be much easier to read with a single explicit retry:
if (checkpoint_and_try_to_stream() < 0) {
/* we busted the limit; make a new pack and try again */
hashfile_truncate();
etc...
if (checkpoint_and_try_to_stream() < 0)
BUG("yikes, we should not fail a second time!");
}
where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.
Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.
But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.
which would do the trick, but it feels awfully hacky to have the "if (checkpoint.f != state->f)" check in there, since that feels too intimately tied to the implementation of the hashfile_checkpoint API for my comfort.
I think you could unconditionally checkpoint at that part; we're about to do a write, so we want to store the state before the write in case we need to roll back.
Anyway, that's all to say that I think that while this is probably doable in theory, in practice it's kind of a mess, at least currently. I would rather see if there are other ways to clean up the deflate_blob_to_pack() function first in a way that made this change less awkward.
Yeah, I actually wrote what I wrote above before reading this far down in your email, but we arrived at the exact same conclusion. ;) Hopefully what I wrote might give some pointers if somebody wants to refactor later.
I think the most reasonable course here would be to pursue a minimal change like the one presented here and then think about further clean up as a separate step.
Yep. Thanks for looking into it. -Peff