Re: [PATCH 0/6] odb: track commit graphs via object source

From: Patrick Steinhardt <hidden>
Date: 2025-09-08 11:17:52

On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:

On 9/4/2025 7:12 PM, Junio C Hamano wrote:

quoted

Patrick Steinhardt [off-list ref] writes:

quoted

commit graphs are currently stored on the object database level. This
doesn't really make much sense conceptually, given that commit graphs
are specific to one object source. Furthermore, with the upcoming
pluggable object database effort, an object source's backend may not
evene have a commit graph in the first place but store that information
in a different format altogether.

This patch series prepares for that by moving the commit graph from
`struct object_database` into `struct odb_source`.

Hmph, I am finding the above hard to agree with at the conceptual
level.  In some future, we may use multiple object stores in a
single repository.  Perhaps we would be storing older parts of
history in semi-online storage while newer parts are stored in
readily available storage.  But the side data structure that allows
us to quickly learn who are parents of one commit is without having
to go to the object store in order to parse the actualy commit
object can be stored for the entire history if we wanted to, or more
recent part of the history but not limited to the "readily available
storage" part.  IOW, where the boundary between the older and the
newer parts of the history lies and which commits the commit graph
covers should be pretty much independent.

So moving from object_database (i.e. the whole world) to individual
odb_source (i.e. where one particular subset of the history is
stored) feels like totally backwards to me.  Surely, a commit graph
file may be defined over a set of packfiles and remaining loose
object files, but it is not like an instance of the commit-graph
file is tied to packfiles in the sense that it uses the index into
some packfile instead of the actual object names to refer to
commits, or anything like that (this is quite different from other
files that are very specific to a single object store, like midx
that is tied to the packfiles it describes).

This is an interesting aspect to things, where the commit-graph file
is a "structured cache" of certain commit information. It happens to
be located within the object stores (either local or in an alternate)
but is conceptually different in a few ways.

The biggest difference is that you can only open one commit-graph
(or chain of commit-graphs). Having multiple files across different
object stores will not accumulate additional context. Instead, we
have a "first one wins" approach.

This does seem to be something that you are attempting to change
by including the ability to load a commit-graph for each odb (and
closing them in sequence as we close a repo).

So in this sense, the commit-graph lives at the repository level,
not an object store level. When doing I/O to write or read a graph,
we use a specific object store at a time.

The other direction to consider is what context we have when we
interact with a commit-graph. We generally are parsing commits from
a repository or loading Bloom filter data during file history walks.
Each of these do not have a predictable nature of which object store
will "own" the commit we are inspecting, so it wouldn't make sense
to restrict things like odb_parse_commit() over repo_parse_commit().

With these thoughts in mind, I have these big-picture thoughts:

1. Patches 1-5 are great. Nice cleanups.

2. Some of Patch 6 is great, including having the I/O methods use
   an odb_source to help focus the specific location of the files
   being read or written. However, the movement of the struct into
   the odb_source makes less sense and should still exist at the
   object_database level.

I (probably unsurprisingly :)) don't quite agree with this.

Let's take a step back: why does the commit-graph exist in the first
place? It basically provides a caching mechanism to efficiently return
information that is otherwise more expensive to obtain:

  - It contains a cached representation of the graph so that we don't
    have to parse each commit from the object database.

  - It encodes generation numbers.

  - It contains bloom filters.

All of which makes sense with the current design of our object storage
format, because obtaining this information can be quite expensive. But
let's consider a different world where we for example store objects in a
proper database:

  - This database may have an efficient way to compute generation
    numbers on the fly, either when reading an object or when writing it
    to disk. We cannot currently store that information in the packfile
    right now, so it needs to be stored out-of-band. But with a database
    there is no reason why we couldn't immediately compute and store the
    generation number on each insert.

  - This database may have an efficient way to store bloom filters next
    to a specific commit directly, without requiring a separate file.

  - This database may be distributed. So why should every client now
    have to recompute a commit graph if we can instead store the data in
    the database and thus have it accessible to all clients thereof?

  - It may be _less_ efficient to use the commit graph to access data
    compared to what that database can provide.

So I would claim that the commit graph is specifically tied to the
actual storage format of objects, and it's not at all obvious that it
would need to exist if we had a different storage format.

The goal of this patch series is thus explicitly _not_ to allow loading
one commit graph per object source. In fact, the refactorings I did
ensure that we still only ever load a single commit graph.

Instead, the goal is to allow each object source to decide for itself
how this additional information is to be stored and retrieved. This
_may_ be a commit graph if that makes sense for a particular storage
format. But it may just as well _not_ be a commit graph, as other
storage formats may have way better solutions for making the commit
graph information accessible.

Patrick

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help