Re: [PATCH v13 04/13] reftable: file format documentation

From: Jonathan Nieder <hidden>
Date: 2020-05-20 18:52:07

Junio C Hamano wrote:

Han-wen wrote:

quoted

From: Jonathan Nieder <redacted>

Shawn Pearce explains:

Some repositories contain a lot of references (e.g. android at 866k,
rails at 31k). The reftable format provides:

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Near constant time verification a SHA-1 is referred to by at least
  one reference (for allow-tip-sha1-in-want).

Not quite grammatical sentence?  Perhaps "if" after "verification?

Good catch, thanks.

[...]

quoted

using pandoc 2.2.1.  The result required the following additional
minor changes:

- removed the [TOC] directive to add a table of contents, since
  asciidoc does not support it
- replaced git-scm.com/docs links with linkgit: directives that link
  to other pages within Git's documentation

There are many

	’

funny-quotes where we would prefer to place vanilla single quotes,
which may also need to be corrected in the conversion toolchain.

Looks like Han-Wen is taking care of this (thanks!).

Typoes pointed out below may probably be from the original where
they should be corrected.

I'm happy to do one final update the doc in JGit to match what we end
up with and then replace it with a pointer to Git's copy once that
lands.

[...]

quoted

+Repositories with many loose references occupy a large number of disk
+blocks from the local file system, as each reference is its own file
+storing 41 bytes (and another file for the corresponding reflog). This
+negatively affects the number of inodes available when a large number of
+repositories are stored on the same filesystem. Readers can be penalized
+due to the larger number of syscalls required to traverse and read the
+`$GIT_DIR/refs` directory.

Another downside is that we cannot arrange atomic updates to
multiple refs over loose refs, even though the "lookup of a single
reference does not require linear scan" unlike packed-refs, (as long
as the filesystem does its job).  Worth mentioning?

Yes, this was another major part of the motivation (avoiding the
complication of the "atomic" multi-ref updates to packed-refs that Git
and JGit had to learn).

[...]

quoted

+References stored in a reftable are peeled, a record for an annotated
+(or signed) tag records both the tag object, and the object it refers
+to.

OK.  Peeled results are recorded in packed-refs file because quite
often when we use a tag object, what we actually want to access is
the commit object it points at.  We do so here for the same reason?

Not a rhetorical question, but if it invites a question from a
reader, it may deserve to be described before readers ask it.

For a single tag ref, peeling to a commit is not very expensive.  But
for batch lookups e.g. when serving a response to an ls-remote
request, it adds up, and having the peeled results recorded helps.

[...]

quoted

+Directory/file conflicts
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format accepts both `refs/heads/foo` and
+`refs/heads/foo/bar` as distinct references.
+
+This property is useful for retaining log records in reftable, but may
+confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
+references. Users of reftable may choose to continue to reject `foo` and
+`foo/bar` type conflicts to prevent problems for peers.

[...]

"users ... may choose" implies that it is up to the implementation
of reftable user which one to show, so given a single repository,
"jgit" may show "refs/heads/foo" while "libgit2" may choose to show
the other one.

I am not sure if that is desirable---I suspect that we want to
record which one needs to be chosen so that these "D/F conflicts
disallowing" users can make consistent choices, but I dunno.

Yes, I think it would be better to explicitly say that Git will continue
to reject D/F conflicts for refs (*not* reflogs) even though the format
can support them in principle.

If we choose to permit them some day in the future, I believe that would
be a separate repository format extension and protocol capability to
avoid confusing old versions of Git.

[...]

quoted

+Symbolic references use `0x3`, followed by the complete name of the
+reference target. No compression is applied to the target name.

Is there a place in the file format where an incomplete name can be
stored?  If not, I think it makes it easier to read if we drop
"complete" from the sentence.

The sentence about "no compression" covers the lack of prefix encoding,
so I suppose I agree.

Might make sense to say "full name" to convey that we're talking about
rev-parse --symbolic-full-name, not a relative path like symlinks
support.

[...]

quoted

+Log block format
+^^^^^^^^^^^^^^^^
+
+Unlike ref and obj blocks, log blocks are always unaligned.
+
+Log blocks are variable in size, and do not match the `block_size`
+specified in the file header or footer. Writers should choose an
+appropriate buffer size to prepare a log block for deflation, such as
+`2 * block_size`.

I can guess the reason behind this design decision, but the readers
may not be able to.  Should we write it down here, or would it make
too much irrelevant details?

I don't have a strong opinion.  It sounds like Han-Wen sees something to
explain there, so I suppose it would be nice to spell out.

(My take: reflog lookups are not on the critical path for most
operations; especially, random accesses do not need to be fast.  From a
performance perspective, the best we can do is to compress them well to
decrease I/O cost, hence there's not much value to alignment.)

[...]

This is a tangent but in a repository at hosting provider, whose
primary (and often the only) source of updates are by end-user
pushing into it, if reflogs are enabled, whose name and email are
recorded in the logs?  The committer or tagger of the object that
sits at the tip of the ref after the update?  What happens when a
blob is pushed to update a ref?  Or would it be just a single "user"
that represents the "server operator"?

The latter, "server operator" (GIT_COMMITTER_IDENT at the server).

Committer in commit objects is forgeable, hence wouldn't be very
useful here.

We know in a non-bare repository an individual contributor works on
typically records only one <name, email> in the reflog: the user who
works in it.

What I am trying to get at is if it makes more sense to have a small
table of unique <name, email> pairs used in the file and have
log_data record a single varint that is the index into that
"committer ident" table.  I would suspect that it would give us
significantly more gain than mere <> two bytes per log_data entry.

That's true, and a good idea for the next rev of the format.

[...]

quoted

+A 68-byte footer appears at the end:
+
+....
+    'REFT'
+    uint8( version_number = 1 )
+    uint24( block_size )
+    uint64( min_update_index )
+    uint64( max_update_index )
+
+    uint64( ref_index_position )
+    uint64( (obj_position << 5) | obj_id_len )

[...]

quoted

+* `obj_id_len`: number of bytes used to abbreviate object identifiers in
+obj blocks.

Should we write "this can be up to 31" somewhere?  It is more than
enough for SHA-1 and not quite sufficient for SHA-256 (unless we say
"we store obj_id_len-1 here")?

Oh!  I'll take a closer look and then follow up.

Thanks for looking it over,
Jonathan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help