Re: [PATCH] doc: add a explanation of Git's data model
From: Julia Evans <hidden>
Date: 2025-10-07 18:55:58
On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote:
On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:quoted
diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..4b2cb167dc --- /dev/null +++ b/Documentation/gitdatamodel.adoc@@ -0,0 +1,226 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object" "reference" or "index".There's a missing comma after "object".
Will fix.
quoted
+ +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>This list makes sense to me. There's of course more data structures in Git, but all the other data structures shouldn't really matter to users at all as they are mostly caches or internal details of the on-disk format. There's potentially one exception though, namely the Git configuration. I'd claim that Git "uses" the Git configuration similarly to how it uses the others, but I get why it's not explicitly mentioned here.quoted
+[[objects]] +OBJECTS +------- + +Commits, trees, blobs, and tag objects are all stored in Git's object database. +Every object has: + +1. an *ID*, which is the SHA-1 hash of its contents.I think this needs to be adapted to not single out SHA-1 as the only hashing algorithm. We already support SHA-256, so we should definitely say that the algorithm can be swapped. Maybe something like: An *object ID*, which is the cryptographic hash of its contents. By default, Git uses SHA-1 as object hash, but alternative hashes like SHA-256 are supported.
Makes sense. I might just say "cryptographic hash of its type and contents" and leave it that. I'm not sure it's worth getting into details of the exact hash function.
quoted
+ It's fast to look up a Git object using its ID. + The ID is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type.Nit: every object also has an object size. Not sure though whether it's fine to imply that with "contents".
I think it is.
quoted
+Once an object is created, it can never be changed. +Here are the 4 types of objects: + +[[commit]] +commits:: + A commit contains: ++ +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2+ parentsI'd say "at least two parents" instead of "2+ parents".quoted
+2. A *commit message* +3. All the *files* in the commit, stored as a *<<tree,tree>>* +4. An *author* and the time the commit was authored +5. A *committer* and the time the commit was committed ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya [off-list ref] 1759173425 -0400 +committer Maya [off-list ref] 1759173425 -0400 + +Add README +----In practice, commits can have other headers that are ignored by Git. But that's certainly not part of Git's core data model, so I don't think we should mention that here.quoted
+Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new commit. +The old commit will eventually be deleted by `git gc`.If we mention git-gc(1) I think it would make sense to use `linkgit:git-gc[1]` instead to provide a link to its man page.
Agreed.
quoted
+[[tree]] +trees:: + A tree is how Git represents a directory. It lists, for each item in + the tree: ++ +1. The *permissions*, for example `100644`I think we should rather call these "mode bits". These bits are permissions indeed when you have a blob, but for subtrees, symlinks and submodules they aren't.
I think it's a bit strange to call them mode bits since I thought they were stored as ASCII strings and it's basically an enum of 5 options, but I see your point. I think "file mode" will work and that's used elsewhere. I wonder if it would make sense to list all of the possible file modes if this isn't documented anywhere else, my impression is that it's a short list and that it's unlikely to change much in the future. And listing them all might make it more clear that Git's file modes don't have much in common with Unix file modes. I looked for where this is documented and it looks like the only place is in `man git-fast-import` . That man page says that there are just 5 options (040000, 160000, 100644, 100755, 120000)
quoted
+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), + or <<commit,`commit`>> (a Git submodule)There's also symlinks.
I created a test symlink and it looks like symlinks are stored as type "blob". I might say which type corresponds to which file mode, though I'm not sure what type corresponds to the "gitlink" mode (commit?). I think these are the 5 modes and what they mean / what type they should have. Not sure about the gitlink mode though. - `100644`: regular file (with type `blob`) - `100755`: executable file (with type `blob`) - `120000`: symbolic link (with type `blob`) - `040000`: directory (with type `tree`) - `160000`: gitlink, for use with submodules (with type `commit`)
quoted
+3. The *object ID* +4. The *filename* ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- ++ +*NOTE:* The permissions are in the same format as UNIX permissions, but +the only allowed permissions for files (blobs) are 644 and 755. + +[[blob]] +blobs:: + A blob is how Git represents a file. A blob object contains the + file's contents. ++ +Storing a new blob for every new version of a file can get big, so +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.I would claim that it's not necessary to mention object compression. This should be a low-level detail that users don't ever have to worry about. Furthermore, packing objects isn't only relevant in the context of blobs: trees for example also tend to compress very well as there typically is only small incremental updates to trees.
I discussed why I think this important in another reply, https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/ (local), will paste what I said here. I'll think about this more though. paste follows: That's true! The reason I think this is important to mention is that I find that people often "reject" information that they find implausible, even if it comes from a credible source. ("that can't be true! I must be not understanding correctly. Oh well, I'll just ignore that!") I sometimes hear from users that "commits can't be snapshots", because it would take up too much disk space to store every version of every commit. So I find that sometimes explaining a little bit about the implementation can make the information more memorable. Certainly I'm not able to remember details that don't make sense with my mental model of how computers work and I don't expect other people to either, so I think it's important to give an explanation that handles the biggest "objections".
quoted
+[[tag-object]] +tag objects:: + Tag objects (also known as "annotated tags") contain: ++ +1. The *tagger* and tag date +2. A *tag message*, similar to a commit message +3. The *ID* of the object (often a commit) that they referenceThey can also be signed, if we want to mention that.
I guess that's true for commit objects too. Not sure whether to mention it either, can add it if others think it's important.
quoted
+[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References that you create are stored in the `.git/refs` directory, +and Git has a few special internal references like `HEAD` that are stored +in the base `.git` directory.This isn't true anymore with the introduction of the reftable backend, which is slated to become the default backend. I'd argue that this is another implementation detail that the user shouldn't have to worry about.
Makes sense, will fix. (as well as other references to the .git prefix and "subdirectories").
quoted
+References can either be: + +1. References to an object ID, usually a <<commit,commit>> ID +2. References to another reference. This is called a "symbolic reference". + +Git handles references differently based on which subdirectory of +`.git/refs` they're stored in.So instead of saying "subdirectory", I'd rather say "reference hierarchy". In general, I think we should explain that references are layed out in a hierarchy. This is somewhat obvious with the "files" backend, as we use directories there. But as we move on to the "reftable" backend this may become less obvious over time.
That makes sense.
quoted
+[[tag]] +tags: `.git/refs/tags/<name>`:: + A tag is a name for a commit ID, tag object ID, or other object ID. + Tags are stored in the `refs/tags/` directory. ++ +Even though branches and commits are both "a name for a commit ID", Git +treats them very differently. +Branches are expected to be regularly updated as you work on the branch, +but it's expected that a tag will never change after you create it.This sounds a bit like the user itself needs to update the branch. How about this instead: Even though branches and commits are both "a name for a commit ID", Git treats them very differently: - Branches can be checked out directly. If so, creating a new commit will automatically update the checked-out branch to point to the new commit. - Tags cannot be checked out directly and don't move when creating a new commit. Instead, one can only check out the commit that a branch points to. This is called "detached HEAD", and the effect is that a new commit will not update
I think mentioning that branches can be checked out and that tags can't is a good idea.
quoted
+[[HEAD]] +HEAD: `.git/HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>. + `HEAD` is normally a symbolic reference to your current branch, for + example `ref: refs/heads/main` if your current branch is `main`. + `HEAD` can also be a direct reference to a commit ID, + that's called "detached HEAD state". + +[[remote-tracking-branch]] +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: + A remote-tracking branch is a name for a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this.This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic reference that indicates the default branch on the remote side.
Is "refs/remotes/<remote>/HEAD" a remote-tracking branch? I've never thought about that reference and I'm not sure what to call it.
quoted
+[[other-refs]] +Other references:: + Git tools may create references in any subdirectory of `.git/refs`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `.git/refs/stash`, `.git/refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references in the base `.git` directory +other than `HEAD`, like `ORIG_HEAD`.Let's mention that such references are typically spelt all-uppercase with underscores between. You shouldn't ever create a reference that is for example called ".git/foo". We enforce this restriction inconsistently, only, but I don't think that should keep us from spelling out the common rule.
That makes sense. I'm also not sure whether third-party Git tools are "supposed" to create references outside of "refs/", or whether that's common.
quoted
+*NOTE:* As an optimization, references may be stored as packed +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].I'd drop this note. It's an internal implementation detail and only true for the "files" backend. The "reftable" backend stores references quite differently and doesn't really "pack" references.quoted
+[[index]] +THE INDEX +--------- + +The index, also known as the "staging area", contains the current stagedHonestly, I always forget which of these two nouns we are supposed to use nowadays. I think consensus was to use "index" and avoid using "staging area"? Not sure though, but I think we should only mention one of these.quoted
+version of every file in your Git repository. When you commit, the files +in the index are used as the files in the next commit. + +Unlike a tree, the index is a flat list of files. +Each index entry has 4 fields: + +1. The *permissions* +2. The *<<blob,blob>> ID* of the file +3. The *filename* +4. The *number*. This is normally 0, but if there's a merge conflictI think we don't call this "number", but "stage".
Thanks, I see that it's sometimes called "stage number" which is a little easier to search for so I'll call it that.
quoted
+ there can be multiple versions (with numbers 0, 1, 2, ..) + of the same filename in the index. + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Git stores the history of branch, tag, and HEAD refs in a reflog +(you should read "reflog" as "ref log"). Not every ref is logged by +default, but any ref can be logged.If we mention this here, do we maybe want to mention how the user can decide which references are logged?
Do you mean by using the setting `core.logAllRefUpdates`?
quoted
+Each reflog entry has: + +1. *Before/after *commit IDs*This will probably misformat as we have three asterisks here, not two.quoted
+2. *User* who made the change, for example `Maya [off-list ref]` +3. *Timestamp*Suggestion: "*Timestamp* when that change has been made".
Makes sense.
quoted
+4. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes.We may want ot mention that you can reference reflog entries via `refs/heads/<branch>@{<reflog-nr>}`. In general, one thing that I think would be important to highlight in this document is revisions. Most of the commands tend to not accept references, but revisions instead, which are a lot more flexible. They use our do-what-I-mean mechanism to resolve, but also allow the user to specify commits relative to one another. It's probably sufficient though to mention them briefly and then redirect to girevisions(7).
Will think about this, I'm not sure how to best incorporate that. Maybe under the commits section.
Thanks for working on this!
Thanks for the review! - Julia