Thread (55 messages) 55 messages, 4 authors, 2025-01-23

Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()

From: Jeff King <hidden>
Date: 2025-01-18 12:15:16
Subsystem: the rest · Maintainer: Linus Torvalds

On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:
quoted hunk ↗ jump to hunk
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);

+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;

 	while (1) {
Yeah, that's ugly. We are potentially throwing away the hashfile that
the checkpoint was created for. That makes my instinct to push the
checkpoint down into the loop where we we might restart a new pack, like
this (and like you suggested below):
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	already_hashed_to = 0;
 
 	while (1) {
+		struct hashfile_checkpoint checkpoint = {0};
 		prepare_to_stream(state, flags);
 		if (idx) {
+			hashfile_checkpoint_init(state->f, &checkpoint);
 			hashfile_checkpoint(state->f, &checkpoint);
 			idx->offset = state->offset;
 			crc32_begin(state->f);
but that doesn't work, because the checkpoint is also used later for the
already_written() check:

        if (already_written(state, result_oid)) {
                hashfile_truncate(state->f, &checkpoint);
                state->offset = checkpoint.offset;
                free(idx);
	} else

That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.

So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).

But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.

I do think it would be much easier to read with a single explicit retry:

  if (checkpoint_and_try_to_stream() < 0) {
	/* we busted the limit; make a new pack and try again */
	hashfile_truncate();
	etc...
        if (checkpoint_and_try_to_stream() < 0)
		BUG("yikes, we should not fail a second time!");
  }

where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.

Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.

But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.
which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.
I think you could unconditionally checkpoint at that part; we're about
to do a write, so we want to store the state before the write in case we
need to roll back.
Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.
Yeah, I actually wrote what I wrote above before reading this far down
in your email, but we arrived at the exact same conclusion. ;) Hopefully
what I wrote might give some pointers if somebody wants to refactor
later.
I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.
Yep. Thanks for looking into it.

-Peff
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help