Thread (26 messages) 26 messages, 5 authors, 2025-11-12

Re: [PATCH] doc: add a explanation of Git's data model

From: Julia Evans <hidden>
Date: 2025-10-07 18:55:58


On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote:
On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
quoted
diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
new file mode 100644
index 0000000000..4b2cb167dc
--- /dev/null
+++ b/Documentation/gitdatamodel.adoc
@@ -0,0 +1,226 @@
+gitdatamodel(7)
+===============
+
+NAME
+----
+gitdatamodel - Git's core data model
+
+DESCRIPTION
+-----------
+
+It's not necessary to understand Git's data model to use Git, but it's
+very helpful when reading Git's documentation so that you know what it
+means when the documentation says "object" "reference" or "index".
There's a missing comma after "object".
Will fix.
quoted
+
+Git's core operations use 4 kinds of data:
+
+1. <<objects,Objects>>: commits, trees, blobs, and tag objects
+2. <<references,References>>: branches, tags,
+   remote-tracking branches, etc
+3. <<index,The index>>, also known as the staging area
+4. <<reflogs,Reflogs>>
This list makes sense to me. There's of course more data structures in
Git, but all the other data structures shouldn't really matter to users
at all as they are mostly caches or internal details of the on-disk
format.

There's potentially one exception though, namely the Git configuration.
I'd claim that Git "uses" the Git configuration similarly to how it uses
the others, but I get why it's not explicitly mentioned here.
quoted
+[[objects]]
+OBJECTS
+-------
+
+Commits, trees, blobs, and tag objects are all stored in Git's object database.
+Every object has:
+
+1. an *ID*, which is the SHA-1 hash of its contents.
I think this needs to be adapted to not single out SHA-1 as the only
hashing algorithm. We already support SHA-256, so we should definitely
say that the algorithm can be swapped. Maybe something like:

  An *object ID*, which is the cryptographic hash of its contents. By
  default, Git uses SHA-1 as object hash, but alternative hashes like
  SHA-256 are supported.
Makes sense. I might just say "cryptographic hash of its type and contents"
and leave it that. I'm not sure it's worth getting into details
of the exact hash function.
quoted
+  It's fast to look up a Git object using its ID.
+  The ID is usually represented in hexadecimal, like
+  `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
+2. a *type*. There are 4 types of objects:
+   <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
+   and <<tag-object,tag objects>>.
+3. *contents*. The structure of the contents depends on the type.
Nit: every object also has an object size. Not sure though whether it's
fine to imply that with "contents".
I think it is.
quoted
+Once an object is created, it can never be changed.
+Here are the 4 types of objects:
+
+[[commit]]
+commits::
+    A commit contains:
++
+1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
+  regular commits have 1 parent, merge commits have 2+ parents
I'd say "at least two parents" instead of "2+ parents".
quoted
+2. A *commit message*
+3. All the *files* in the commit, stored as a *<<tree,tree>>*
+4. An *author* and the time the commit was authored
+5. A *committer* and the time the commit was committed
++
+Here's how an example commit is stored:
++
+----
+tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
+parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
+author Maya [off-list ref] 1759173425 -0400
+committer Maya [off-list ref] 1759173425 -0400
+
+Add README
+----
In practice, commits can have other headers that are ignored by Git. But
that's certainly not part of Git's core data model, so I don't think we
should mention that here.
quoted
+Like all other objects, commits can never be changed after they're created.
+For example, "amending" a commit with `git commit --amend` creates a new commit.
+The old commit will eventually be deleted by `git gc`.
If we mention git-gc(1) I think it would make sense to use
`linkgit:git-gc[1]` instead to provide a link to its man page.
Agreed.
quoted
+[[tree]]
+trees::
+    A tree is how Git represents a directory. It lists, for each item in
+    the tree:
++
+1. The *permissions*, for example `100644`
I think we should rather call these "mode bits". These bits are
permissions indeed when you have a blob, but for subtrees, symlinks and
submodules they aren't.
I think it's a bit strange to call them mode bits since I thought they were stored
as ASCII strings and it's basically an enum of 5 options, but I see your point.
I think "file mode" will work and that's used elsewhere.

I wonder if it would make sense to list all of the possible file modes if
this isn't documented anywhere else, my impression is that it's a short
list and that it's unlikely to change much in the future.

And listing them all might make it more clear that Git's file modes don't
have much in common with Unix file modes.
I looked for where this is documented and it looks like the only place is
in `man git-fast-import` . That man page says that there are just 5 options
(040000, 160000, 100644, 100755, 120000)
quoted
+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
+  or <<commit,`commit`>> (a Git submodule)
There's also symlinks.
I created a test symlink and it looks like symlinks are stored as type "blob".
I might say which type corresponds to which file mode,
though I'm not sure what type corresponds to the "gitlink" mode (commit?).

I think these are the 5 modes and what they mean / what type they
should have. Not sure about the gitlink mode though.

  - `100644`: regular file (with type `blob`)
  - `100755`: executable file (with type `blob`)
  - `120000`: symbolic link (with type `blob`)
  - `040000`: directory (with type `tree`)
  - `160000`: gitlink, for use with submodules (with type `commit`)
quoted
+3. The *object ID*
+4. The *filename*
++
+For example, this is how a tree containing one directory (`src`) and one file
+(`README.md`) is stored:
++
+----
+100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
+040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
+----
++
+*NOTE:* The permissions are in the same format as UNIX permissions, but
+the only allowed permissions for files (blobs) are 644 and 755.
+
+[[blob]]
+blobs::
+    A blob is how Git represents a file. A blob object contains the
+    file's contents.
++
+Storing a new blob for every new version of a file can get big, so
+`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
I would claim that it's not necessary to mention object compression.
This should be a low-level detail that users don't ever have to worry
about. Furthermore, packing objects isn't only relevant in the context
of blobs: trees for example also tend to compress very well as there
typically is only small incremental updates to trees.
I discussed why I think this important in another reply,
https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/ (local),
will paste what I said here. I'll think about this more though.

paste follows:

That's true! The reason I think this is important to mention is that I find
that people often "reject" information that they find implausible, even
if it comes from a credible source. ("that can't be true! I must be
not understanding correctly. Oh well, I'll just ignore that!")

I sometimes hear from users that "commits can't be snapshots", because
it would take up too much disk space to store every version of
every commit. So I find that sometimes explaining a little bit about the
implementation can make the information more memorable.

Certainly I'm not able to remember details that don't make sense
with my mental model of how computers work and I don't expect other
people to either, so I think it's important to give an explanation that
handles the biggest "objections".
quoted
+[[tag-object]]
+tag objects::
+    Tag objects (also known as "annotated tags") contain:
++
+1. The *tagger* and tag date
+2. A *tag message*, similar to a commit message
+3. The *ID* of the object (often a commit) that they reference
They can also be signed, if we want to mention that.
I guess that's true for commit objects too. Not sure whether to
mention it either, can add it if others think it's important.
quoted
+[[references]]
+REFERENCES
+----------
+
+References are a way to give a name to a commit.
+It's easier to remember "the changes I'm working on are on the `turtle`
+branch" than "the changes are in commit bb69721404348e".
+Git often uses "ref" as shorthand for "reference".
+
+References that you create are stored in the `.git/refs` directory,
+and Git has a few special internal references like `HEAD` that are stored
+in the base `.git` directory.
This isn't true anymore with the introduction of the reftable backend,
which is slated to become the default backend. I'd argue that this is
another implementation detail that the user shouldn't have to worry
about.
Makes sense, will fix. (as well as other references to the .git prefix and
"subdirectories").
quoted
+References can either be:
+
+1. References to an object ID, usually a <<commit,commit>> ID
+2. References to another reference. This is called a "symbolic reference".
+
+Git handles references differently based on which subdirectory of
+`.git/refs` they're stored in.
So instead of saying "subdirectory", I'd rather say "reference
hierarchy".

In general, I think we should explain that references are layed out
in a hierarchy. This is somewhat obvious with the "files" backend, as we
use directories there. But as we move on to the "reftable" backend this
may become less obvious over time.
That makes sense.
quoted
+[[tag]]
+tags: `.git/refs/tags/<name>`::
+    A tag is a name for a commit ID, tag object ID, or other object ID.
+    Tags are stored in the `refs/tags/` directory.
++
+Even though branches and commits are both "a name for a commit ID", Git
+treats them very differently.
+Branches are expected to be regularly updated as you work on the branch,
+but it's expected that a tag will never change after you create it.
This sounds a bit like the user itself needs to update the branch. How
about this instead:

    Even though branches and commits are both "a name for a commit ID", Git
    treats them very differently:

        - Branches can be checked out directly. If so, creating a new
          commit will automatically update the checked-out branch to
          point to the new commit.

        - Tags cannot be checked out directly and don't move when
          creating a new commit. Instead, one can only check out the
          commit that a branch points to. This is called "detached
          HEAD", and the effect is that a new commit will not update 
I think mentioning that branches can be checked out and that tags can't
is a good idea.
quoted
+[[HEAD]]
+HEAD: `.git/HEAD`::
+    `HEAD` is where Git stores your current <<branch,branch>>.
+    `HEAD` is normally a symbolic reference to your current branch, for
+    example `ref: refs/heads/main` if your current branch is `main`.
+    `HEAD` can also be a direct reference to a commit ID,
+    that's called "detached HEAD state".
+
+[[remote-tracking-branch]]
+remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
+    A remote-tracking branch is a name for a commit ID.
+    It's how Git stores the last-known state of a branch in a remote
+    repository. `git fetch` updates remote-tracking branches. When
+    `git status` says "you're up to date with origin/main", it's looking at
+    this.
This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
reference that indicates the default branch on the remote side.
Is "refs/remotes/<remote>/HEAD" a remote-tracking branch?
I've never thought about that reference and I'm not sure what to call it.
quoted
+[[other-refs]]
+Other references::
+    Git tools may create references in any subdirectory of `.git/refs`.
+    For example, linkgit:git-stash[1], linkgit:git-bisect[1],
+    and linkgit:git-notes[1] all create their own references
+    in `.git/refs/stash`, `.git/refs/bisect`, etc.
+    Third-party Git tools may also create their own references.
++
+Git may also create references in the base `.git` directory
+other than `HEAD`, like `ORIG_HEAD`.
Let's mention that such references are typically spelt all-uppercase
with underscores between. You shouldn't ever create a reference that is
for example called ".git/foo".

We enforce this restriction inconsistently, only, but I don't think that
should keep us from spelling out the common rule.
That makes sense. I'm also not sure whether third-party
Git tools are "supposed" to create references outside of "refs/",
or whether that's common. 
quoted
+*NOTE:* As an optimization, references may be stored as packed
+refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
I'd drop this note. It's an internal implementation detail and only true
for the "files" backend. The "reftable" backend stores references quite
differently and doesn't really "pack" references.
quoted
+[[index]]
+THE INDEX
+---------
+
+The index, also known as the "staging area", contains the current staged
Honestly, I always forget which of these two nouns we are supposed to
use nowadays. I think consensus was to use "index" and avoid using
"staging area"? Not sure though, but I think we should only mention
one of these.
quoted
+version of every file in your Git repository. When you commit, the files
+in the index are used as the files in the next commit.
+
+Unlike a tree, the index is a flat list of files.
+Each index entry has 4 fields:
+
+1. The *permissions*
+2. The *<<blob,blob>> ID* of the file
+3. The *filename*
+4. The *number*. This is normally 0, but if there's a merge conflict
I think we don't call this "number", but "stage".
Thanks, I see that it's sometimes called "stage number" which is a little
easier to search for so I'll call it that.
quoted
+   there can be multiple versions (with numbers 0, 1, 2, ..)
+   of the same filename in the index.
+
+It's extremely uncommon to look at the index directly: normally you'd
+run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
+But you can use `git ls-files --stage` to see the index.
+Here's the output of `git ls-files --stage` in a repository with 2 files:
+
+----
+100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
+100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
+----
+
+[[reflogs]]
+REFLOGS
+-------
+
+Git stores the history of branch, tag, and HEAD refs in a reflog
+(you should read "reflog" as "ref log"). Not every ref is logged by
+default, but any ref can be logged.
If we mention this here, do we maybe want to mention how the user can
decide which references are logged?
Do you mean by using the setting `core.logAllRefUpdates`?
quoted
+Each reflog entry has:
+
+1. *Before/after *commit IDs*
This will probably misformat as we have three asterisks here, not two.
quoted
+2. *User* who made the change, for example `Maya [off-list ref]`
+3. *Timestamp*
Suggestion: "*Timestamp* when that change has been made".
Makes sense.
quoted
+4. *Log message*, for example `pull: Fast-forward`
+
+Reflogs only log changes made in your local repository.
+They are not shared with remotes.
We may want ot mention that you can reference reflog entries via
`refs/heads/<branch>@{<reflog-nr>}`.

In general, one thing that I think would be important to highlight in
this document is revisions. Most of the commands tend to not accept
references, but revisions instead, which are a lot more flexible. They
use our do-what-I-mean mechanism to resolve, but also allow the user to
specify commits relative to one another. It's probably sufficient though
to mention them briefly and then redirect to girevisions(7).
Will think about this, I'm not sure how to best incorporate that.
Maybe under the commits section.
Thanks for working on this!
Thanks for the review!

- Julia
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help