Re: [PATCH v13 04/13] reftable: file format documentation
From: Jonathan Nieder <hidden>
Date: 2020-05-20 18:52:07
Junio C Hamano wrote:
Han-wen wrote:
quoted
From: Jonathan Nieder <redacted> Shawn Pearce explains: Some repositories contain a lot of references (e.g. android at 866k, rails at 31k). The reftable format provides: - Near constant time lookup for any single reference, even when the repository is cold and not in process or kernel cache. - Near constant time verification a SHA-1 is referred to by at least one reference (for allow-tip-sha1-in-want).Not quite grammatical sentence? Perhaps "if" after "verification?
Good catch, thanks. [...]
quoted
using pandoc 2.2.1. The result required the following additional minor changes: - removed the [TOC] directive to add a table of contents, since asciidoc does not support it - replaced git-scm.com/docs links with linkgit: directives that link to other pages within Git's documentationThere are many ’ funny-quotes where we would prefer to place vanilla single quotes, which may also need to be corrected in the conversion toolchain.
Looks like Han-Wen is taking care of this (thanks!).
Typoes pointed out below may probably be from the original where they should be corrected.
I'm happy to do one final update the doc in JGit to match what we end up with and then replace it with a pointer to Git's copy once that lands. [...]
quoted
+Repositories with many loose references occupy a large number of disk +blocks from the local file system, as each reference is its own file +storing 41 bytes (and another file for the corresponding reflog). This +negatively affects the number of inodes available when a large number of +repositories are stored on the same filesystem. Readers can be penalized +due to the larger number of syscalls required to traverse and read the +`$GIT_DIR/refs` directory.Another downside is that we cannot arrange atomic updates to multiple refs over loose refs, even though the "lookup of a single reference does not require linear scan" unlike packed-refs, (as long as the filesystem does its job). Worth mentioning?
Yes, this was another major part of the motivation (avoiding the complication of the "atomic" multi-ref updates to packed-refs that Git and JGit had to learn). [...]
quoted
+References stored in a reftable are peeled, a record for an annotated +(or signed) tag records both the tag object, and the object it refers +to.OK. Peeled results are recorded in packed-refs file because quite often when we use a tag object, what we actually want to access is the commit object it points at. We do so here for the same reason? Not a rhetorical question, but if it invites a question from a reader, it may deserve to be described before readers ask it.
For a single tag ref, peeling to a commit is not very expensive. But for batch lookups e.g. when serving a response to an ls-remote request, it adds up, and having the peeled results recorded helps. [...]
quoted
+Directory/file conflicts +^^^^^^^^^^^^^^^^^^^^^^^^ + +The reftable format accepts both `refs/heads/foo` and +`refs/heads/foo/bar` as distinct references. + +This property is useful for retaining log records in reftable, but may +confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain +references. Users of reftable may choose to continue to reject `foo` and +`foo/bar` type conflicts to prevent problems for peers.
[...]
"users ... may choose" implies that it is up to the implementation of reftable user which one to show, so given a single repository, "jgit" may show "refs/heads/foo" while "libgit2" may choose to show the other one. I am not sure if that is desirable---I suspect that we want to record which one needs to be chosen so that these "D/F conflicts disallowing" users can make consistent choices, but I dunno.
Yes, I think it would be better to explicitly say that Git will continue to reject D/F conflicts for refs (*not* reflogs) even though the format can support them in principle. If we choose to permit them some day in the future, I believe that would be a separate repository format extension and protocol capability to avoid confusing old versions of Git. [...]
quoted
+Symbolic references use `0x3`, followed by the complete name of the +reference target. No compression is applied to the target name.Is there a place in the file format where an incomplete name can be stored? If not, I think it makes it easier to read if we drop "complete" from the sentence.
The sentence about "no compression" covers the lack of prefix encoding, so I suppose I agree. Might make sense to say "full name" to convey that we're talking about rev-parse --symbolic-full-name, not a relative path like symlinks support. [...]
quoted
+Log block format +^^^^^^^^^^^^^^^^ + +Unlike ref and obj blocks, log blocks are always unaligned. + +Log blocks are variable in size, and do not match the `block_size` +specified in the file header or footer. Writers should choose an +appropriate buffer size to prepare a log block for deflation, such as +`2 * block_size`.I can guess the reason behind this design decision, but the readers may not be able to. Should we write it down here, or would it make too much irrelevant details?
I don't have a strong opinion. It sounds like Han-Wen sees something to explain there, so I suppose it would be nice to spell out. (My take: reflog lookups are not on the critical path for most operations; especially, random accesses do not need to be fast. From a performance perspective, the best we can do is to compress them well to decrease I/O cost, hence there's not much value to alignment.) [...]
This is a tangent but in a repository at hosting provider, whose primary (and often the only) source of updates are by end-user pushing into it, if reflogs are enabled, whose name and email are recorded in the logs? The committer or tagger of the object that sits at the tip of the ref after the update? What happens when a blob is pushed to update a ref? Or would it be just a single "user" that represents the "server operator"?
The latter, "server operator" (GIT_COMMITTER_IDENT at the server). Committer in commit objects is forgeable, hence wouldn't be very useful here.
We know in a non-bare repository an individual contributor works on typically records only one <name, email> in the reflog: the user who works in it. What I am trying to get at is if it makes more sense to have a small table of unique <name, email> pairs used in the file and have log_data record a single varint that is the index into that "committer ident" table. I would suspect that it would give us significantly more gain than mere <> two bytes per log_data entry.
That's true, and a good idea for the next rev of the format. [...]
quoted
+A 68-byte footer appears at the end: + +.... + 'REFT' + uint8( version_number = 1 ) + uint24( block_size ) + uint64( min_update_index ) + uint64( max_update_index ) + + uint64( ref_index_position ) + uint64( (obj_position << 5) | obj_id_len )
[...]
quoted
+* `obj_id_len`: number of bytes used to abbreviate object identifiers in +obj blocks.Should we write "this can be up to 31" somewhere? It is more than enough for SHA-1 and not quite sufficient for SHA-256 (unless we say "we store obj_id_len-1 here")?
Oh! I'll take a closer look and then follow up. Thanks for looking it over, Jonathan