Re: [PATCH 0/6] odb: track commit graphs via object source
From: Patrick Steinhardt <hidden>
Date: 2025-09-08 11:17:52
On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:
On 9/4/2025 7:12 PM, Junio C Hamano wrote:quoted
Patrick Steinhardt [off-list ref] writes:quoted
commit graphs are currently stored on the object database level. This doesn't really make much sense conceptually, given that commit graphs are specific to one object source. Furthermore, with the upcoming pluggable object database effort, an object source's backend may not evene have a commit graph in the first place but store that information in a different format altogether. This patch series prepares for that by moving the commit graph from `struct object_database` into `struct odb_source`.Hmph, I am finding the above hard to agree with at the conceptual level. In some future, we may use multiple object stores in a single repository. Perhaps we would be storing older parts of history in semi-online storage while newer parts are stored in readily available storage. But the side data structure that allows us to quickly learn who are parents of one commit is without having to go to the object store in order to parse the actualy commit object can be stored for the entire history if we wanted to, or more recent part of the history but not limited to the "readily available storage" part. IOW, where the boundary between the older and the newer parts of the history lies and which commits the commit graph covers should be pretty much independent. So moving from object_database (i.e. the whole world) to individual odb_source (i.e. where one particular subset of the history is stored) feels like totally backwards to me. Surely, a commit graph file may be defined over a set of packfiles and remaining loose object files, but it is not like an instance of the commit-graph file is tied to packfiles in the sense that it uses the index into some packfile instead of the actual object names to refer to commits, or anything like that (this is quite different from other files that are very specific to a single object store, like midx that is tied to the packfiles it describes).This is an interesting aspect to things, where the commit-graph file is a "structured cache" of certain commit information. It happens to be located within the object stores (either local or in an alternate) but is conceptually different in a few ways. The biggest difference is that you can only open one commit-graph (or chain of commit-graphs). Having multiple files across different object stores will not accumulate additional context. Instead, we have a "first one wins" approach. This does seem to be something that you are attempting to change by including the ability to load a commit-graph for each odb (and closing them in sequence as we close a repo). So in this sense, the commit-graph lives at the repository level, not an object store level. When doing I/O to write or read a graph, we use a specific object store at a time. The other direction to consider is what context we have when we interact with a commit-graph. We generally are parsing commits from a repository or loading Bloom filter data during file history walks. Each of these do not have a predictable nature of which object store will "own" the commit we are inspecting, so it wouldn't make sense to restrict things like odb_parse_commit() over repo_parse_commit(). With these thoughts in mind, I have these big-picture thoughts: 1. Patches 1-5 are great. Nice cleanups. 2. Some of Patch 6 is great, including having the I/O methods use an odb_source to help focus the specific location of the files being read or written. However, the movement of the struct into the odb_source makes less sense and should still exist at the object_database level.
I (probably unsurprisingly :)) don't quite agree with this.
Let's take a step back: why does the commit-graph exist in the first
place? It basically provides a caching mechanism to efficiently return
information that is otherwise more expensive to obtain:
- It contains a cached representation of the graph so that we don't
have to parse each commit from the object database.
- It encodes generation numbers.
- It contains bloom filters.
All of which makes sense with the current design of our object storage
format, because obtaining this information can be quite expensive. But
let's consider a different world where we for example store objects in a
proper database:
- This database may have an efficient way to compute generation
numbers on the fly, either when reading an object or when writing it
to disk. We cannot currently store that information in the packfile
right now, so it needs to be stored out-of-band. But with a database
there is no reason why we couldn't immediately compute and store the
generation number on each insert.
- This database may have an efficient way to store bloom filters next
to a specific commit directly, without requiring a separate file.
- This database may be distributed. So why should every client now
have to recompute a commit graph if we can instead store the data in
the database and thus have it accessible to all clients thereof?
- It may be _less_ efficient to use the commit graph to access data
compared to what that database can provide.
So I would claim that the commit graph is specifically tied to the
actual storage format of objects, and it's not at all obvious that it
would need to exist if we had a different storage format.
The goal of this patch series is thus explicitly _not_ to allow loading
one commit graph per object source. In fact, the refactorings I did
ensure that we still only ever load a single commit graph.
Instead, the goal is to allow each object source to decide for itself
how this additional information is to be stored and retrieved. This
_may_ be a commit graph if that makes sense for a particular storage
format. But it may just as well _not_ be a commit graph, as other
storage formats may have way better solutions for making the commit
graph information accessible.
Patrick