Thread (6 messages) 6 messages, 2 authors, 2023-06-27

Re: Determining whether you have a commit locally, in a partial clone?

From: Tao Klerks <hidden>
Date: 2023-06-21 10:11:58

On Wed, Jun 21, 2023 at 8:45 AM Jeff King [off-list ref] wrote:
On Tue, Jun 20, 2023 at 09:12:24PM +0200, Tao Klerks wrote:
quoted
I'm back to begging for any hints here: Any idea how I can determine
whether a given commit object exists locally, *without causing it to
be fetched by the act of checking for it?*
This is not very efficient, but:

  git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered |
  grep $some_sha1

will tell you whether we have the object locally.
Thanks so much for your help!

in Windows (msys or git bash) this is still very slow in my repo with
6,500,000 local objects - around 60s - but in linux on the same repo
it's quite a lot faster, at 5s. A large proportion of my users are on
Windows though, so I don't think this will be "good enough" for my
purposes, when I often need to check for the existence of dozens or
even hundreds of commits.
I don't work with partial clones often, but it feels like being able to
say:

  git --no-partial-fetch cat-file ...

would be a useful primitive to have.
It feels that way to me, yes!

On the other hand, I find very little demand for it when I search "the
internet" - or I don't know how to search for it.

quoted hunk ↗ jump to hunk
The implementation might start
something like this:
diff --git a/object-file.c b/object-file.c
index 7c1af5c8db..494cdd7706 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1555,6 +1555,14 @@ void disable_obj_read_lock(void)

 int fetch_if_missing = 1;

+static int allow_lazy_fetch(void)
+{
+       static int ret = -1;
+       if (ret < 0)
+               ret = git_env_bool("GIT_PARTIAL_FETCH", 1);
+       return ret;
+}
+
 static int do_oid_object_info_extended(struct repository *r,
                                       const struct object_id *oid,
                                       struct object_info *oi, unsigned flags)
@@ -1622,6 +1630,7 @@ static int do_oid_object_info_extended(struct repository *r,

                /* Check if it is a missing object */
                if (fetch_if_missing && repo_has_promisor_remote(r) &&
+                   allow_lazy_fetch() &&
                    !already_retried &&
                    !(flags & OBJECT_INFO_SKIP_FETCH_OBJECT)) {
                        promisor_remote_get_direct(r, real, 1);
and then have git.c populate the environment variable, similar to how we
handle --literal-pathspecs, etc.

That fetch_if_missing kind of does the same thing, but it's mostly
controlled by programs themselves which try to handle missing remote
objects specially.
Thanks, I will play with this if I get the chance. That said, I don't
control my users' distributions of Git, so on a purely practical basis
I'm looking for something that will work in git 2.39 to whatever
future version would introduce such a capability. (before 2.39, the
"set remote to False" hack works)
It does seem like you might be able to bend it to
your will here, though. I think without any patches that:

  git rev-list --objects --exclude-promisor-objects $oid

will tell you whether we have the object or not (since it turns off
fetch_if_missing, and thus will either succeed, printing nothing, or
bail if the object can't be found).
This behaves in a way that I don't understand:

In the repo that I'm working in, this command runs successfully
*without fetching*, but it takes a *very* long time - 300+ seconds -
much longer than even the "inefficient" 'cat-file'-based printing of
all (6.5M) local object ids that you proposed above. I haven't
attempted to understand what's going on in there (besides running with
GIT_TRACE2_PERF, which showed nothing interesting), but the idea that
git would have to work super-hard to find an object by its ID seems
counter to everything I know about it. Would there be value in my
trying to understand & reproduce this in a shareable repo, or is there
already an explanation as to why this command could/should ever do
non-trivial work, even in the largest partial repos?
It feels like --missing=error should
function similarly, but it seems to still lazy-fetch (I guess since it's
the default, the point is to just find truly unavailable objects). Using
--missing=print disables the lazy-fetch, but it seems to bail
immediately if you ask it about a missing object (I didn't dig, but my
guess is that --missing is mostly about objects we traverse, not the
initial tips).
Woah, "--missing=print" seems to work!!!

The following gives me the commit hash if I have it locally, and an
error otherwise - consistently across linux and windows, git versions
2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy
CPU-churning:

git rev-list --missing=print -1 $oid

Thank you thank you thank you!

I feel like I should try to work something into the doc about this,
but I'm not sure how to express this: "--missing=error is the default,
but it doesn't actually error out when you're explicitly asking about
a missing commit, it fetches it instead - but --missing=print actually
*does* error out if you explicitly ask about a missing commit" seems
like a strange thing to be saying.

Thanks again for finding me an efficient working strategy here!
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help