Thread (9 messages) 9 messages, 4 authors, 2024-10-14

Re: Missing Promisor Objects in Partial Repo Design Doc

From: Jonathan Tan <hidden>
Date: 2024-10-09 18:53:15

Junio C Hamano [off-list ref] writes:
quoted
(C2b is a bit of a special case. Despite not being in a promisor pack,
it is still considered to be a promisor object since C3 directly
references it.)
Yes, and I suspect the root cause of this confusion is because
"promisor object", as defined today, is a flawed concept.  If C2b
were pointed by a local ref, just like the case the ref points at
C2a, they should be treated the same way, as both of them are
locally created.  To put it another way, presumably the local have
already been pushed out to elsewhere and the promisor remote got
hold of them, and that is why C3 can build on top of them.  And the
fact C2b is directly reachable from C3 and C2a is not should not
have any relevance if C2a or C2b are not _included_ in promisor
packs (hence both of them need to be included in the local pack).

Two concepts that would have been useful are (1) objects that are in
promisor packs and (2) objects that are reachable from an object
that is in a promisor pack.  I do not see how the current definition
of "promisor objects" (i.e. in a promisor pack, or one hop from an
object in a promisor pack) is useful in any context.
The one-hop part in the current definition is meant to (a) explain what
objects the client knows the remote has (in theory the client has no
knowledge of objects beyond the first hop, but we now know this theory
to not be true) and (b) explain what objects a non-promisor object can
reference (in particular, a non-promisor tree can reference promisor
blobs, even when our knowledge of that promisor blob only comes from a
tree in a promisor pack).

If we think that a promisor commit being a child of a non-promisor
commit as a "bad state" that needs to be fixed [1], then the one-hop
current definition seems to be equivalent to (2).

As for (1), we do use that concept in Git, although it's limited to the
repack during GC (or maybe there are others that I don't recall), so the
concept doesn't have a widely-used name like "promisor object".

[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/ (local)
quoted
Garbage Collection repack
-------------------------
Not yet implemented.

Same concept at “fetch repack”, but happens during garbage collection
instead. The traversal is more expensive since we no longer have access
to what was recently fetched so we have to traverse through all promisor
packs to collect tips of “bad” history.
In other words, with the status quo, "git gc" that attempts to
repack "objects in promisor packs" and "other objects that did not
get repacked in the step that repack objects in promisor packs"
separately, it implements the latter in a buggy way and discards
some objects.  And fixing that bug by doing the right thing is
expensive.

Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
"git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
C3 is known to the remote and the remote promises everything
reachable from what they offer are (re)fetchable from them?  IOW, is
this a correctness issue, or only performance issue (of having to
re-fetch what we once locally had)?
I believe the re-fetch didn't happen because it was run from a command
with fetch_if_missing=0. (But even if we decide that we shouldn't use
fetch_if_missing, and then change all commands to not use it, there
still remains the performance issue, so we should still fix it.)
quoted
Cons: Packing local objects into promisor packs means that it is no
longer possible to detect if an object is missing due to repository
corruption or because we need to fetch it from a promisor remote.
Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
the current version of "git gc" removes them from the local object
store, that they are missing due to repository corruption?  After
all, C3 can reach them so wouldn't it be possible for us to fetch
them from the promisor remote?

After a lazy clone that omits a lot of objects acquires many objects
over time by fetching missing objects on demand, wouldn't we want to
have an option to "slim" the local repository by discarding some of
these objects (the ones that are least frequently used), relying on
the promise by the promisor remote that even if we did so, they can
be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
feature prematurely kicked in?  Or are we failing to refetch them
for some reason?
This is under the "repack all" option, which states that we repack all
objects (wherever they came from) into promisor packs. If we locally
created commit A and then its child commit B, and the repo got corrupted
so that we lost A, repacking all objects would mean that we could never
detect that the loss of A is problematic.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help