Re: [PATCH] commit-graph: retain commit slab when closing NULL commit_graph
From: Taylor Blau <hidden>
Date: 2024-01-05 17:07:46
On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote:
This fixes a regression introduced in ac6d45d11f (commit-graph: move
slab-clearing to close_commit_graph(), 2023-10-03), in which running:
git -c fetch.writeCommitGraph=true fetch --recurse-submodules
multiple times in a freshly cloned repository causes a segfault. What
happens in the second (and subsequent) runs is this:
1. We make a "struct commit" for any ref tips which we're storing
(even if we already have them, they still go into FETCH_HEAD).
Because the first run will have created a commit graph, we'll find
those commits in the graph.
The commit struct is therefore created with a NULL "maybe_tree"
entry, because we can load its oid from the graph later. But to do
that we need to remember that we got the commit from the graph,
which is recorded in a global commit_graph_data_slab object.
2. Because we're using --recurse-submodules, we'll try to fetch each
of the possible submodules. That implies creating a separate
"struct repository" in-process for each submodule, which will
require a later call to repo_clear().
The call to repo_clear() calls raw_object_store_clear(), which in
turn calls close_object_store(), which in turn calls
close_commit_graph(). And the latter frees the commit graph data
slab.
3. Later, when trying to write out a new commit graph, we'll ask for
their tree oid via get_commit_tree_oid(), which will see that the
object is parsed but with a NULL maybe_tree field. We'd then
usually pull it from the graph file, but because the slab was
cleared, we don't realize that we can do so! We end up returning
NULL and segfaulting.
(It seems questionable that we'd write a graph entry for such a
commit anyway, since we know we already have one. I didn't
double-check, but that may simply be another side effect of having
cleared the slab).Yeah, I agree that we should not be re-writing a commit-graph entry that already exists in an earlier (non-removed) layer of the commit-graph. I haven't thought through all of the details here, but that sounds like a potentially dangerous thing to be doing.
So that fixes the regression in the least-risky way possible.
Thanks for the detailed explanation. I think that the fix presented here is a reasonable approach, and is worthwhile to pick up.
I do think there's some fragility here that we might want to follow up on. We have a global commit_graph_data_slab that contains graph positions, and our global commit structs depend on the that slab remaining valid. But close_commit_graph() is just about closing _one_ object store's graph. So it's dangerous to call that function and clear the slab without also throwing away any "struct commit" we might have parsed that depends on it.
I definitely agree that there seems like a high risk of another potentially-dangerous bug happening as a result of this area. One thing that sticks out to me is that we have a scope mismatch between the commit structs, the raw_object_store, and the commit slabs. Slabs are tied to the lifetime of the raw_object_store, but the commit objects are tied to the global lifetime of the process. I wonder if it would make sense to have a slab per object-store, and require that you pass both the commit *and* the object-store you're looking it up in as arguments to any slab-related functions. I dunno. I have not put a ton of thought into that ^^ approach either, so let me know if it seems obviously bogus to you or if this is worth looking into further. Thanks, Taylor