Re: Missing Promisor Objects in Partial Repo Design Doc

From: Junio C Hamano <hidden>
Date: 2024-10-02 02:54:53

Missing Promisor Objects in Partial Repo Design Doc
===================================================

Basic Reproduction Steps
------------------------

 - Partial clone repository
 - Create local commit and push
 - Fetch new changes
 - Garbage collection

State After Reproduction
------------------------

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2b ---- T2b -- B2b (created locally, in non-promisor pack)
  |
  C2a ---- T2a -- B2a (created locally, in non-promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Explanation of the Problem
--------------------------

In a partial clone repository, non-promisor commits are locally
committed as children of promisor commits and then pushed up to the
server. Fetches of new history can result in promisor commits that have
non-promisor commits as ancestors. During garbage collection, objects
are repacked in 2 steps. In the first step, if there is more than one
promisor packfile, all objects in promisor packfiles are repacked into a
single promisor packfile. In the second step, a revision walk is made
from all refs (and some other things like HEAD and reflog entries) that
stops whenever it encounters a promisor object. In the example above, if
a ref pointed directly to C2a, it would be returned by the walk (as an
object to be packed). But if we only had a ref pointing to C3, the
revision walk immediately sees that it is a promisor object, does not
return it, and does not iterate through its parents.

True.  Will it become even worse, if a protocol extension Christian
proposes starts suggesting a repository that is not lazy to add a
promisor remote?  In such a set-up, perhaps all history leading to
C2b down to the root are local, but C3 may have come from a promisor
remote (hence in a promisor pack).

(C2b is a bit of a special case. Despite not being in a promisor pack,
it is still considered to be a promisor object since C3 directly
references it.)

Yes, and I suspect the root cause of this confusion is because
"promisor object", as defined today, is a flawed concept.  If C2b
were pointed by a local ref, just like the case the ref points at
C2a, they should be treated the same way, as both of them are
locally created.  To put it another way, presumably the local have
already been pushed out to elsewhere and the promisor remote got
hold of them, and that is why C3 can build on top of them.  And the
fact C2b is directly reachable from C3 and C2a is not should not
have any relevance if C2a or C2b are not _included_ in promisor
packs (hence both of them need to be included in the local pack).

Two concepts that would have been useful are (1) objects that are in
promisor packs and (2) objects that are reachable from an object
that is in a promisor pack.  I do not see how the current definition
of "promisor objects" (i.e. in a promisor pack, or one hop from an
object in a promisor pack) is useful in any context.

If we think this is a bad state, we should propagate the “promisor-ness”
of C3 to its ancestors. Git commands should either prevent this state
from occurring or tolerate it and fix it when we can. If we did run into
this state unexpectedly, then it would be considered a BUG.

Yup, that is the basis of the solutions we saw proposed so far.

If we think it is a valid state, we should NOT propagate the
“promisor-ness” of C3 to its ancestors. Git commands should respect that
this is a possible state and be able to work around it. Therefore, this
bug would then be strictly caused by garbage collection

Yes, that is possibly an alternative.

Bad State Solutions
===================

Fetch negotiation
-----------------
Implemented at
https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/ (local)

During fetch negotiation, if a commit is not in a promisor pack and
therefore local, do not declare it as "have" so they can be fetched into
a promisor pack.

Cost:
- Creation of set of promisor pack objects (by iterating through every
  .idx of promisor packs)

What is "promisor PACK objects"?  Is it different from the "promisor
objects" (i.e. what I called the useless definition above)?

- Refetch number of local commits

Pros: Implementation is simple, client doesn’t have to repack, prevents
state from ever occurring in the repository.

Cons: Network cost of refetching could be high if many local commits
need to be refetched.

What if we get into the same state by creating local C4, which gets
to outside and on top of which C5 is built, which is now sitting at
the tip of the remote history and we fetch from them?  In order to
include C4 in the "promisor pack", we refrain from saying C4 is a
"have" for us and refetch.  Would C2 be fetched again?

I do not think C2 would be, because we made it an object in a
promisor pack when we "fixed" the history for C3.

So the cost will not grow proportionally to the depth of the
history, which makes it OK from my point of view.

Garbage Collection repack
-------------------------
Not yet implemented.

Same concept at “fetch repack”, but happens during garbage collection
instead. The traversal is more expensive since we no longer have access
to what was recently fetched so we have to traverse through all promisor
packs to collect tips of “bad” history.

In other words, with the status quo, "git gc" that attempts to
repack "objects in promisor packs" and "other objects that did not
get repacked in the step that repack objects in promisor packs"
separately, it implements the latter in a buggy way and discards
some objects.  And fixing that bug by doing the right thing is
expensive.

Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
"git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
C3 is known to the remote and the remote promises everything
reachable from what they offer are (re)fetchable from them?  IOW, is
this a correctness issue, or only performance issue (of having to
re-fetch what we once locally had)?

Cons: Packing local objects into promisor packs means that it is no
longer possible to detect if an object is missing due to repository
corruption or because we need to fetch it from a promisor remote.

Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
the current version of "git gc" removes them from the local object
store, that they are missing due to repository corruption?  After
all, C3 can reach them so wouldn't it be possible for us to fetch
them from the promisor remote?

After a lazy clone that omits a lot of objects acquires many objects
over time by fetching missing objects on demand, wouldn't we want to
have an option to "slim" the local repository by discarding some of
these objects (the ones that are least frequently used), relying on
the promise by the promisor remote that even if we did so, they can
be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
feature prematurely kicked in?  Or are we failing to refetch them
for some reason?

Packing local objects into promisor packs means that garbage collection
will no longer remove unreachable local objects.

Valid State Solutions
=====================
Garbage Collection check
------------------------
Not yet implemented.

Currently during the garbage collection rev walk, whenever a promisor
commit is reached, it is marked UNINTERESTING, and then subsequently all
ancestors of the promisor commit are traversed and also marked
UNINTERESTING. Therefore, add a check for whether a commit is local or
not during promisor commit ancestor traversal and do not mark local
commits as UNINTERESTING.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Cost:
- Adds an additional check to every ancestor of a promisor commit.

This is practically the only solution if the state is valid. Fsck would
also have to start checking for validity of ancestors of promisor
commits instead of ignoring them as it currently does.

In the longer term, this looks like the most straight-forward and
easy to explain solution to me.

Optimizations
=============

The “creation of set of promisor pack objects” can be replaced with
“creation of set of non-promisor objects” since the latter is almost
always cheaper and we can check for non-existence rather than existence.
This does not work for “fetch negotiation” since if we have a commit
that's in both a promisor pack and a non-promisor pack, the algorithm's
correctness relies on the fact that we report it as a promisor object
(because we really need the server to re-send it).

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help