Re: [PATCH v4] doc: add an explanation of Git's data model

From: Julia Evans <hidden>
Date: 2025-10-28 20:11:21

quoted

+
+It's not necessary to understand Git's data model to use Git, but it's
+very helpful when reading Git's documentation so that you know what it
+means when the documentation says "object", "reference" or "index".

"While it is not necessary ..., it is helpful ..." may flow better
than "It is not necesary ..., but it is very helpful".

quoted

+This means that if you have an object's ID, you can always recover its
+exact contents as long as the object hasn't been deleted.

Somewhere in distant footnote, we may want to mention that objects
that are in use are never deleted, and when they get removed (i.e.,
garbage collection).  As part of the data model, "everything is
retained by default, until we can prove it is no longer reachable"
probably belongs somewhere.

Agreed, I really like this idea. Came up with the following, which I'll put at
the bottom of the "References" section if I don't come up with a better idea.
(I don't feel strongly about where exactly it should go):

NOTE: Objects will only be deleted if they aren't "reachable" from any reference.
An object is "reachable" if we can find it by following tags to whatever
they tag, commits to their parents or trees, and trees to the trees or
blobs that they contain.
For example, if you amend a commit, with `git commit --amend`,
the old commit will usually not be reachable, so it may be deleted eventually.

quoted

+Here's how each type of object is structured:
+
+[[commit]]
+commit::
+    A commit contains the full directory structure of every file
+    in that version of the repository and each file's contents.

What you are describing here is more of the property of a tree; a
commit is a bit richer.

    A commit records a snapshot of the every file in the project at
    one point in time, records who contributed to create such a
    snapshot and why, and how that particular snapshot relates to
    other snapshots in the history.

I don't understand the goal of explaining a commit in detail in
paragraph form when we already explain everything in a commit right
below this.

My goal of this intro sentence is just to emphasize what I think is the
least obvious point in that list, which is that commits contain every file. 

Happy to change it to something shorter like
"A commit records a snapshot of the every file in the project" if you
prefer that wording.

quoted

+    It has these these required fields

"these these".

Oops, will fix

quoted

+Like all other objects, commits can never be changed after they're created.
+For example, "amending" a commit with `git commit --amend` creates a new
+commit with the same parent.

"same parent." -> "same parent, without modifying the original
commit object at all"?  Maybe redundant?  I dunno.

quoted

+[[tree]]
+tree::
+    A tree is how Git represents a directory.

"a directory" -> "contents in a directory"?  I dunno.

quoted

+    It can contain files or other trees (which are subdirectories).
+    It lists, for each item in the tree:
++
+1. The *filename*, for example `hello.py`
+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
+  or <<commit,`commit`>> (a Git submodule, which is a
+  commit from a different Git repository)

This is a bit of white lie.  A tree object entry never stores the
type of the object.  It records <mode, object name, path component>.

The second field you see in git ls-tree output is computed from the
object name (when the object is available) or inferred from the mode
bits.

Thanks, I didn't realize how tree object entries were stored.
Will remove "type".

quoted

+3. The *file mode*. Git has these file modes. which are only
+   spiritually related to Unix permissions:

In the cover letter part of the message I am responding to, I saw
repeated mention of "permissions should be "file mode"; let's be
consistent.

"Git has these file modes, which are ..." ->

Makes sense. Will change to "Unix file modes" from "Unix permissions".
I don't think this needs a more dramatic rewrite though.

    Git uses the following file mode to represent what each tree
    entry is (because an object of the same type, e.g. "blob", is
    used to represent more than one kind of things).  The file mode
    are assigned to resemble Unix file mode.

    Note that Git does not _store_ permissions, and there are only
    two kinds of regular files; non-executable (100644) or
    executable (100755).  To Git, there are no files that are
    "readable only by the owner" etc., so file mode bits like
    100600, 100400, etc., are never used.

quoted

+[[tag-object]]
+tag object::
+    Tag objects contain these required fields
+    (though there are other optional fields):
++
+1. The *ID* and *type* of the object (often a commit) that they reference

Not wrong per-se, but it is a bit curious to lump these two into a
single enumerated item here, unlike "author" and "committer" were
enumerated separately for commit objects.  If you are going to show
"cat-file -p" output for illustration, it may be help readers
understand them if you had them separately listed here.

Agreed, I'll split them into two items.

quoted

+2. The *tagger* and tag date
+3. A *tag message*, similar to a commit message

quoted

+[[index]]
+THE INDEX
+---------
+The index, also known as the "staging area", is a list of files and
+the contents of each file, stored as a <<blob,blob>>.
+You can add files to the index or update the contents of a file in the
+index with linkgit:git-add[1]. This is called "staging" the file for commit.
+
+Unlike a <<tree,tree>>, the index is a flat list of files.

This is a bit of white lie, as modern versions of Git could be
collapsing uninteresting parts of the directory structure as a
single tree in an index entry (this is called "sparse index"), and
can expand such collapsed "tree" in the index on-demand into its
constituent files and directories.  But I do not mind presenting the
traditional world model for conceptual simplicity.

I didn't know that, thanks. I guess I'll leave it the way it is for now.
It could be good to add a footnote, but I don't actually know how
to add footnotes in this document format.

quoted

+When you commit, Git converts the list of files in the index to a
+directory <<tree,tree>> and uses that tree in the new <<commit,commit>>.
+
+Each index entry has 4 fields:
+
+1. The *<<tree,file mode>>*
+2. The *<<blob,blob>> ID* of the file

If you were to collapse descriptions like you did for tag objects
where ID and TYPE were treated as a unit, here is the place to do
so.  With the mode bits and object ID, we can represent regular
files that are non-executable, regular files that are executable,  
symbolic links, and submodules (if a sparse-index is in use, an
index entry could be a subdirectory, but I suggested above that we
can ignore them for simplicity).

But <<blob,blob>> is highly misleading.  Even if we ignore
sparse-index, we may see a commit object there.

Thanks, I didn't realize that. Will change to say that it can be a blob
or commit ID. I don't think that collapsing will help, IMO it's
important to keep a consistent format.

    Each index entry records

    1. The object that occupies the path, as (file mode, object
       name) tuple.  Most often, it is a regular file whose contents
       are stored in a blob object, that is either non-executable
       (100644), executable (100755), or a symbolic link (120000),
       but the object can be a commit in another repository if it
       represents a submodule.

    2. The stage number, which is normally 0, but entries with
       higher stages for the same path are used during a conflicted
       merge.

    3. The path name for the index entry.

quoted

+3. The *file path*, for example `src/hello.py`
+4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if
+   there's a merge conflict there can be multiple versions of the same
+   filename in the index.

If you are going by "ls-files -s" output, it may be better to swap 3
and 4 above for ease of understanding.

Good point, will do.

quoted

+It's extremely uncommon to look at the index directly: normally you'd
+run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
+But you can use `git ls-files --stage` to see the index.
+Here's the output of `git ls-files --stage` in a repository with 2 files:
+
+----
+100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
+100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
+----
+
+[[reflogs]]
+REFLOGS
+-------
+
+Every time a branch, remote-tracking branch, or HEAD is updated, Git
+updates a log called a "reflog" for that <<references,reference>>.

If we want to avoid using word X while explaining X, then we can
rephrase it as "Git updates a record in the reflog for that
reference".

I think the current phrasing is okay. I also didn't respond to some of the
phrasing suggestions above if I didn't understand the goal of them.
Hope that's okay.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help