Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()

[PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Taylor Blau <hidden> · 2024-11-20
[PATCH 1/6] csum-file: store the hash algorithm as a struct field · Taylor Blau <hidden> · 2024-11-20
Re: [PATCH 1/6] csum-file: store the hash algorithm as a struct field · Jeff King <hidden> · 2024-11-21
[PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid() · Taylor Blau <hidden> · 2024-11-20
[PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2024-11-20
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · Jeff King <hidden> · 2024-11-21
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · brian m. carlson <hidden> · 2024-11-22
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · Jeff King <hidden> · 2024-11-22
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · brian m. carlson <hidden> · 2024-11-22
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2025-01-10
Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` · Jeff King <hidden> · 2025-01-11
[PATCH 4/6] csum-file.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2024-11-20
[PATCH 5/6] t/helper/test-hash.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2024-11-20
[PATCH 6/6] hash.h: drop unsafe_ function variants · Taylor Blau <hidden> · 2024-11-20
Re: [PATCH 6/6] hash.h: drop unsafe_ function variants · Jeff King <hidden> · 2024-11-21
[PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Taylor Blau <hidden> · 2025-01-08
[PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper · Taylor Blau <hidden> · 2025-01-08
[PATCH v2 2/8] csum-file: store the hash algorithm as a struct field · Taylor Blau <hidden> · 2025-01-08
Re: [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field · Patrick Steinhardt <hidden> · 2025-01-16
Re: [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field · Taylor Blau <hidden> · 2025-01-17
[PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid() · Taylor Blau <hidden> · 2025-01-08
[PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2025-01-08
Re: [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` · Patrick Steinhardt <hidden> · 2025-01-16
Re: [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2025-01-17
[PATCH v2 5/8] csum-file.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-08
[PATCH v2 6/8] t/helper/test-hash.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-08
[PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() · Taylor Blau <hidden> · 2025-01-08
Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() · Jeff King <hidden> · 2025-01-10
Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() · Taylor Blau <hidden> · 2025-01-10
Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() · Taylor Blau <hidden> · 2025-01-17
Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() · Jeff King <hidden> · 2025-01-18
[PATCH v2 8/8] hash.h: drop unsafe_ function variants · Taylor Blau <hidden> · 2025-01-08
Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Jeff King <hidden> · 2025-01-10
Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Taylor Blau <hidden> · 2025-01-10
Re: [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Jeff King <hidden> · 2025-01-11
[PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 2/8] csum-file: store the hash algorithm as a struct field · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid() · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 5/8] csum-file.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 6/8] t/helper/test-hash.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init() · Taylor Blau <hidden> · 2025-01-17
[PATCH v3 8/8] hash.h: drop unsafe_ function variants · Taylor Blau <hidden> · 2025-01-17
Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Jeff King <hidden> · 2025-01-18
Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Jeff King <hidden> · 2025-01-18
[PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 2/8] csum-file: store the hash algorithm as a struct field · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid() · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()` · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 5/8] csum-file.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 6/8] t/helper/test-hash.c: use unsafe_hash_algo() · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init() · Taylor Blau <hidden> · 2025-01-23
[PATCH v4 8/8] hash.h: drop unsafe_ function variants · Taylor Blau <hidden> · 2025-01-23

From: Jeff King <hidden>
Date: 2025-01-18 12:15:16
Subsystem: the rest · Maintainer: Linus Torvalds

On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:

quoted hunk ↗ jump to hunk

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c

@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);

@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);

+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;

 	while (1) {

Yeah, that's ugly. We are potentially throwing away the hashfile that
the checkpoint was created for. That makes my instinct to push the
checkpoint down into the loop where we we might restart a new pack, like
this (and like you suggested below):

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c

@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);

@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	already_hashed_to = 0;
 
 	while (1) {
+		struct hashfile_checkpoint checkpoint = {0};
 		prepare_to_stream(state, flags);
 		if (idx) {
+			hashfile_checkpoint_init(state->f, &checkpoint);
 			hashfile_checkpoint(state->f, &checkpoint);
 			idx->offset = state->offset;
 			crc32_begin(state->f);

but that doesn't work, because the checkpoint is also used later for the
already_written() check:

        if (already_written(state, result_oid)) {
                hashfile_truncate(state->f, &checkpoint);
                state->offset = checkpoint.offset;
                free(idx);
	} else

That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.

So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).

But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.

I do think it would be much easier to read with a single explicit retry:

  if (checkpoint_and_try_to_stream() < 0) {
	/* we busted the limit; make a new pack and try again */
	hashfile_truncate();
	etc...
        if (checkpoint_and_try_to_stream() < 0)
		BUG("yikes, we should not fail a second time!");
  }

where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.

Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.

But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.

which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.

I think you could unconditionally checkpoint at that part; we're about
to do a write, so we want to store the state before the write in case we
need to roll back.

Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.

Yeah, I actually wrote what I wrote above before reading this far down
in your email, but we arrived at the exact same conclusion. ;) Hopefully
what I wrote might give some pointers if somebody wants to refactor
later.

I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.

Yep. Thanks for looking into it.

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help