Thread (65 messages) 65 messages, 7 authors, 2018-10-16

Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2018-10-05 14:04:20

On Fri, Oct 05 2018, Derrick Stolee wrote:
On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
quoted
On Fri, Oct 05 2018, Derrick Stolee wrote:
quoted
On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
quoted
I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

   * There's a gc.clone.autoDetach=false default setting which overrides
     gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
     --cloning option to indicate this).
I'll repeat that it could make sense to do the same thing on clone
_and_ fetch. Perhaps a "--post-fetch" flag would be good here to
communicate that we just downloaded a pack from a remote.
I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

     # Slow, if we assume background forked commit-graph generation
     # (which I'm avoiding)
     git clone x && cd x && git tag --contains
     # Fast enough, since we have an existing commit-graph
     cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.
My misunderstanding was that your proposed change to gc computes the
commit-graph in either of these two cases:

(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that
we hint that (3) holds by saying "--post-fetch" (i.e. "We just
downloaded a pack, and it probably contains a lot of new commits") or
we could create some more complicated condition based on counting
reachable commits with infinite generation number (the number of
commits not in the commit-graph file).

I like that you are moving forward to make the commit-graph be written
more frequently, but I'm trying to push us in a direction of writing
it even more often than your proposed strategy. We should avoid
creating too many orthogonal conditions that trigger the commit-graph
write, which is why I'm pushing on your design here.

Anyone else have thoughts on this direction?
Ah. I see. I think #3 makes perfect sense, but probably makes sense to
do as a follow-up, or maybe you'd like to stick a patch on top of the
series I have when I send it. I don't know how to write the "I'm not
quite happy about the commit graph" code :)

What I will do is refactor gc.c a bit and leave it in a state where it's
going to be really easy to change the existing "we have no commit graph,
and thus should do the optimization step" to have some more complex
condition instead of "we have no commit graph", i.e. your "we just
grabbed a lot of data".

Also, I'll drop the gc.clone.autoDetach=false setting and name it
something more general. maybe gc.AutoDetachOnBigOptimization=false?
Anyway something more generic so that "clone" will always pass in some
option saying "expect a large % commit graph update" (100% in its case),
and then in "fetch" we could have some detection of how big what we just
got from the server is, and do the same.

This seems to be to be the most general thing that would make sense, and
could also be extended e.g. to "git commit" and other users of gc
--auto. If I started with a README file in an empty repo, and then made
a commit where I added 1 million files all in one commit, in which case
we'd (depending on that setting) also block in the foreground and
generate the commit-graph.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help