Re: Determining whether you have a commit locally, in a partial clone?

From: Jeff King <hidden>
Date: 2023-06-27 08:10:10

On Wed, Jun 21, 2023 at 12:10:33PM +0200, Tao Klerks wrote:

quoted

This is not very efficient, but:

  git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered |
  grep $some_sha1

will tell you whether we have the object locally.

Thanks so much for your help!

in Windows (msys or git bash) this is still very slow in my repo with
6,500,000 local objects - around 60s - but in linux on the same repo
it's quite a lot faster, at 5s. A large proportion of my users are on
Windows though, so I don't think this will be "good enough" for my
purposes, when I often need to check for the existence of dozens or
even hundreds of commits.

Yeah, it's just a lot of object names to print, most of which you don't
care about. :)

The more efficient thing would be to open the actual pack .idx files and
look for the names via binary search. I don't think you can convince git
to do that, though I suspect you could write a trivial libgit2 program
that does.

quoted

I don't work with partial clones often, but it feels like being able to
say:

  git --no-partial-fetch cat-file ...

would be a useful primitive to have.

It feels that way to me, yes!

On the other hand, I find very little demand for it when I search "the
internet" - or I don't know how to search for it.

I think partial clones are still new enough that not many people are
using them heavily. And when they do, not managing the partial state at
a very advanced level; I think tools for pruning locally cached objects
(which you could refetch) is only just being worked on now.

quoted

It does seem like you might be able to bend it to
your will here, though. I think without any patches that:

  git rev-list --objects --exclude-promisor-objects $oid

will tell you whether we have the object or not (since it turns off
fetch_if_missing, and thus will either succeed, printing nothing, or
bail if the object can't be found).

This behaves in a way that I don't understand:

In the repo that I'm working in, this command runs successfully
*without fetching*, but it takes a *very* long time - 300+ seconds -
much longer than even the "inefficient" 'cat-file'-based printing of
all (6.5M) local object ids that you proposed above. I haven't
attempted to understand what's going on in there (besides running with
GIT_TRACE2_PERF, which showed nothing interesting), but the idea that
git would have to work super-hard to find an object by its ID seems
counter to everything I know about it. Would there be value in my
trying to understand & reproduce this in a shareable repo, or is there
already an explanation as to why this command could/should ever do
non-trivial work, even in the largest partial repos?

I think it's actually doing the gigantic traversal (and just limiting it
when it sees objects that are not available). You probably want
"--no-walk" at least, but really you don't even want to walk the trees
of any commits you specify (so you'd want to omit "--objects" if you are
asking about a commit, and otherwise include it, which is slightly
awkward).

quoted

It feels like --missing=error should
function similarly, but it seems to still lazy-fetch (I guess since it's
the default, the point is to just find truly unavailable objects). Using
--missing=print disables the lazy-fetch, but it seems to bail
immediately if you ask it about a missing object (I didn't dig, but my
guess is that --missing is mostly about objects we traverse, not the
initial tips).

Woah, "--missing=print" seems to work!!!

The following gives me the commit hash if I have it locally, and an
error otherwise - consistently across linux and windows, git versions
2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy
CPU-churning:

git rev-list --missing=print -1 $oid

Thank you thank you thank you!

Hmph, I thought I tried that before and it didn't work, but it seems to
work for me now. I guess I was hoping to have it print the missing
object rather than exiting with an error, but if you do one object at a
time then the error is sufficient signal. :)

You might want "--objects" if you're going to ask about non-commits.
Though it might not be necessary. I suspect Git would bail trying to
look up the object in the first place if we don't have it, and if we do
have it then it just becomes a silent noop.

I feel like I should try to work something into the doc about this,
but I'm not sure how to express this: "--missing=error is the default,
but it doesn't actually error out when you're explicitly asking about
a missing commit, it fetches it instead - but --missing=print actually
*does* error out if you explicitly ask about a missing commit" seems
like a strange thing to be saying.

I think we are relying on the side effect that everything except
--missing=error will turn off auto-fetching. I don't know if that's
something we'd want to document. It seems reasonable to me that we might
later change the implementation so that we kick in the --missing
behavior only after parsing the initial list of traversal tips (I mean,
I don't know why we would do that in particular, but it seems like the
kind of thing we'd want to reserve as an implementation detail subject
to change).

I do think in the long run that a big "--do-not-lazy-fetch" flag would
be the right solution to let the user tell us what they want.

Thanks again for finding me an efficient working strategy here!

I'm glad it worked. I was mostly just thinking out loud. ;)

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help