Thread (22 messages) 22 messages, 4 authors, 2023-05-06

Re: [PATCH v2] builtin/pack-objects.c: introduce `pack.extraCruftTips`

From: Taylor Blau <hidden>
Date: 2023-05-03 21:22:17

On Wed, May 03, 2023 at 03:59:06PM -0400, Jeff King wrote:
On Tue, May 02, 2023 at 08:09:47PM -0400, Taylor Blau wrote:
quoted
However, there is no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,
  - to track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer)
  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,
  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.
OK, I understand the use case you're trying to support, and your
approach mostly makes sense. But there are two things I was surprised by
in the implementation:

  1. Does this need to be tied to cruft packs? The same logic would
     apply to "repack -A" which turns objects loose (of course you
     probably don't want to do that in the long term for performance
     reasons, but it's conceptually the same thing; see below).

  2. Why is there a separate walk distinct from the one that rescues
     recent-but-unreachable objects?
Conceptually it seems to me that the simplest and most flexible way to
think of this new feature is: pretend these objects are recent enough to
be kept in the grace period, even though their mtimes do not qualify".

And then everything else would just fall out naturally. Am I missing
something?
I originally shied away from it, thinking that I wouldn't want to do an
expensive walk with all of the recent objects during a non-pruning
repack.

The two code paths are quite different in practice. In the pruning case,
we add only new objects from the kept packs and then start our traversal
there. In the non-pruning case, we just do
`add_objects_in_unpacked_packs()` which is really just a call to
`for_each_packed_object()`.

So it gets tricky when you have a pack.extraCruftTips program and want
to invoke it in a non-pruning case. You could do something like:

  - call enumerate_and_traverse_cruft_objects() *always*, either because
    we were doing a pruning GC, or calling it after
    `enumerate_cruft_objects()` (in the non-pruning case)

  - ensure that enumerate_and_traverse_cruft_objects() is a noop when
    (a) cruft_expiration is set to zero, and (b) there are no
    pack.extraCruftTips programs specified
quoted
+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out = NULL;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
Why does this clear the environment of local-repo variables? That seems
unlike most other hooks we have, and would cause confusion if $GIT_DIR
or various config variables are important (e.g., should "git -c foo.bar
gc" persist foo.bar when running the hook? It usually does).

I know that some hooks that try to change repositories by changing
directories have the opposite confusion ($GIT_DIR is set, but they did
not want it). But it makes more sense to me to remain consistent with
how other hooks behave.
Copy-and-pasted from the core.alternateRefsCommand stuff, there is no
need for this here.
quoted
+	if (start_command(&cmd)) {
+		ret = -1;
+		goto done;
+	}
This may be a matter of taste, but you can "return -1" directly here, as
nothing has been allocated (and a failed start_command() will call
child_process_clear() for you). This would mean "out" is always set in
the "done:" label, so it wouldn't need a NULL-check (nor an
initialization).
Very nice, thanks for the suggestion.
quoted
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&buf, out) != EOF) {
is there any reason to be a stickler for LF versus CRLF here? I.e.,
wouldn't strbuf_getline() be more friendly, with little chance that we
misinterpret the result?
No reason, other than I happened to type this one out first ;-). We
should definitely be more permissive here, I agree that strbuf_getline()
is the friendlier choice.
quoted
+		struct object_id oid;
+		struct object *obj;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		obj = parse_object(the_repository, &oid);
+		if (!obj)
+			continue;
This parse_object() can be pretty expensive, especially if you are
rescuing a lot of objects (or if your program directly references large
blobs). Can we use oid_object_info() here in combination with
lookup_object_by_type()?
Yep, definitely. Nice catch ;-).

Thanks,
Taylor
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help