Re: [PATCH v4 03/10] commit-graph: compute generation numbers
From: Derrick Stolee <hidden>
Date: 2018-05-01 12:10:26
On 4/29/2018 5:08 AM, Jakub Narebski wrote:
Derrick Stolee [off-list ref] writes:quoted
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy.Sidenote: for generation numbers it does not matter if we use depth-first or breadth-first strategy, but it is more natural to use depth-first search because generation numbers need post-order processing (parents before child).quoted
The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file.A question: what happens if the existing commit graph is from older version of git and has _ZERO for generation numbers? Answer: I see that we treat both _INFINITY (not in commit-graph) and _ZERO (in commit graph but not computed) as not computed generation numbers. All right.quoted
If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead.All right, though I guess this would remain theoretical for a long while. We don't have any way of testing this, at least not without recompiling Git with lower value of GENERATION_NUMBER_MAX -- which means not automatically, isn't it?quoted
Signed-off-by: Derrick Stolee <redacted> --- commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..047fa9fca5 100644 --- a/commit-graph.c +++ b/commit-graph.c@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); +If we stumble upon commit marked as "not in commit-graph" while writing commit graph, it is a BUG(), isn't it? (Problem noticed by Junio.)
Since we are computing the values for all commits in the list, this condition is not important and will be removed.
It is a bit strange to me that the code uses get_be32 for reading, but htonl for writing. Is Git tested on non little-endian machines, like big-endian ppc64 or s390x, or on mixed-endian machines (or selectable-endian machines with data endianness set to non little-endian, like ia64)? If not, could we use for example openSUSE Build Service (https://build.opensuse.org/) for this?
Since we are packing two values into 64 bits, I am using htonl() here to arrange the 30-bit generation number alongside the 34-bit commit date value, then writing with hashwrite(). The other 32-bit integers are written with hashwrite_be32() to avoid translating this data in-memory.
quoted
packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8);@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL;All right, commit_list will work as stack.quoted
+ + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue;All right, we consider _INFINITY and _SERO as not computed. If generation number is computed (by 'recursion' or from commit graph), we (re)use it. This means that generation number calculation is incremental, as intended -- good.quoted
+ + commit_list_insert(commits[i], &list);Start depth-first walks from commits given.quoted
+ while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1;Here all_parents_computed is a boolean flag. I see that it is easier to start with assumption that all parents will have computed generation numbers.quoted
+ uint32_t max_generation = 0;The generation number value of 0 functions as sentinel; generation numbers start from 1. Not that it matters much, as lowest possible generation number is 1, and we could have started from that value.
Except that for a commit with no parents, we want it to receive generation number max_generation + 1 = 1, so this value of 0 is important.
quoted
+ + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break;If some parent doesn't have generation number calculated, we add it to stack (and break out of loop because it is depth-first walk), and mark this situation. All right.quoted
+ } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation;Otherwise, update max_generation. All right.quoted
+ } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX;This conditional should be inside all_parents_computed test, for example like this: + if (all_parents_computed) { + current->generation = max_generation + 1; + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + + pop_commit(&list); + } (Noticed by Junio.) Sidenote: when we revisit the commit, returning from depth-first walk of one of its parents, we calculate max_generation from scratch again. This does not matter for performance, as it's just data access and calculating maximum - any workaround to not restart those calculations would take more time and memory. And it's simple.quoted
+ } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs,@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); +Nice and simple. All right. I guess that we do not pass "struct packed_commit_list commits" as argument to compute_generation_numbers instead of "struct commit** commits.list" and "int commits.nr" to compute_generation_numbers() to keep the latter nice and generic?
Good catch. There is no reason to not use packed_commit_list here.
quoted
graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0);Best,