Re: Determining whether you have a commit locally, in a partial clone?
From: Jeff King <hidden>
Date: 2023-06-27 08:10:10
On Wed, Jun 21, 2023 at 12:10:33PM +0200, Tao Klerks wrote:
quoted
This is not very efficient, but: git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered | grep $some_sha1 will tell you whether we have the object locally.Thanks so much for your help! in Windows (msys or git bash) this is still very slow in my repo with 6,500,000 local objects - around 60s - but in linux on the same repo it's quite a lot faster, at 5s. A large proportion of my users are on Windows though, so I don't think this will be "good enough" for my purposes, when I often need to check for the existence of dozens or even hundreds of commits.
Yeah, it's just a lot of object names to print, most of which you don't care about. :) The more efficient thing would be to open the actual pack .idx files and look for the names via binary search. I don't think you can convince git to do that, though I suspect you could write a trivial libgit2 program that does.
quoted
I don't work with partial clones often, but it feels like being able to say: git --no-partial-fetch cat-file ... would be a useful primitive to have.It feels that way to me, yes! On the other hand, I find very little demand for it when I search "the internet" - or I don't know how to search for it.
I think partial clones are still new enough that not many people are using them heavily. And when they do, not managing the partial state at a very advanced level; I think tools for pruning locally cached objects (which you could refetch) is only just being worked on now.
quoted
It does seem like you might be able to bend it to your will here, though. I think without any patches that: git rev-list --objects --exclude-promisor-objects $oid will tell you whether we have the object or not (since it turns off fetch_if_missing, and thus will either succeed, printing nothing, or bail if the object can't be found).This behaves in a way that I don't understand: In the repo that I'm working in, this command runs successfully *without fetching*, but it takes a *very* long time - 300+ seconds - much longer than even the "inefficient" 'cat-file'-based printing of all (6.5M) local object ids that you proposed above. I haven't attempted to understand what's going on in there (besides running with GIT_TRACE2_PERF, which showed nothing interesting), but the idea that git would have to work super-hard to find an object by its ID seems counter to everything I know about it. Would there be value in my trying to understand & reproduce this in a shareable repo, or is there already an explanation as to why this command could/should ever do non-trivial work, even in the largest partial repos?
I think it's actually doing the gigantic traversal (and just limiting it when it sees objects that are not available). You probably want "--no-walk" at least, but really you don't even want to walk the trees of any commits you specify (so you'd want to omit "--objects" if you are asking about a commit, and otherwise include it, which is slightly awkward).
quoted
It feels like --missing=error should function similarly, but it seems to still lazy-fetch (I guess since it's the default, the point is to just find truly unavailable objects). Using --missing=print disables the lazy-fetch, but it seems to bail immediately if you ask it about a missing object (I didn't dig, but my guess is that --missing is mostly about objects we traverse, not the initial tips).Woah, "--missing=print" seems to work!!! The following gives me the commit hash if I have it locally, and an error otherwise - consistently across linux and windows, git versions 2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy CPU-churning: git rev-list --missing=print -1 $oid Thank you thank you thank you!
Hmph, I thought I tried that before and it didn't work, but it seems to work for me now. I guess I was hoping to have it print the missing object rather than exiting with an error, but if you do one object at a time then the error is sufficient signal. :) You might want "--objects" if you're going to ask about non-commits. Though it might not be necessary. I suspect Git would bail trying to look up the object in the first place if we don't have it, and if we do have it then it just becomes a silent noop.
I feel like I should try to work something into the doc about this, but I'm not sure how to express this: "--missing=error is the default, but it doesn't actually error out when you're explicitly asking about a missing commit, it fetches it instead - but --missing=print actually *does* error out if you explicitly ask about a missing commit" seems like a strange thing to be saying.
I think we are relying on the side effect that everything except --missing=error will turn off auto-fetching. I don't know if that's something we'd want to document. It seems reasonable to me that we might later change the implementation so that we kick in the --missing behavior only after parsing the initial list of traversal tips (I mean, I don't know why we would do that in particular, but it seems like the kind of thing we'd want to reserve as an implementation detail subject to change). I do think in the long run that a big "--do-not-lazy-fetch" flag would be the right solution to let the user tell us what they want.
Thanks again for finding me an efficient working strategy here!
I'm glad it worked. I was mostly just thinking out loud. ;) -Peff