Re: [PATCH v2] pack-format.txt: more details on pack file format
From: Stefan Beller <hidden>
Date: 2018-05-10 17:06:42
On Thu, May 10, 2018 at 8:09 AM, Nguyễn Thái Ngọc Duy [off-list ref] wrote:
quoted hunk ↗ jump to hunk
The current document mentions OBJ_* constants without their actual values. A git developer would know these are from cache.h but that's not very friendly to a person who wants to read this file to implement a pack file parser. Similarly, the deltified representation is not documented at all (the "document" is basically patch-delta.c). Translate that C code to English with a bit more about what ofs-delta and ref-delta mean. Signed-off-by: Nguyễn Thái Ngọc Duy <redacted> --- This is a much better description than v1. I hope. Documentation/technical/pack-format.txt | 78 +++++++++++++++++++++++++ cache.h | 5 ++ 2 files changed, 83 insertions(+)diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 8e5bf60be3..d20bf592aa 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt@@ -36,6 +36,84 @@ Git pack format - The trailer records 20-byte SHA-1 checksum of all of the above. +=== Object types + +Valid object types are: + +- OBJ_COMMIT (1) +- OBJ_TREE (2) +- OBJ_BLOB (3) +- OBJ_TAG (4) +- OBJ_OFS_DELTA (6) +- OBJ_REF_DELTA (7) + +Type 5 is reserved for future expansion. Type 0 is invalid. + +=== Deltified representation + +Conceptually there are only four object types: commit, tree, tag and +blob. However to save space, an object could be stored as a "delta" of +another "base" object. These representations are assigned new types +ofs-delta and ref-delta, which is only valid in a pack file.
...only valid... as opposed to loose objects or as opposed to referencing cross-packs? I would think the former, not the latter.
+Both ofs-delta and ref-delta store the "delta" against another +object. The difference between them is, ref-delta directly encodes +20-byte base object name. If the base object is in the same pack, +ofs-delta encodes the offset of the base object in the pack instead.
Reading this paragraph clears up the question from before. The ref delta is a delta to another "reference by hash id (sha1)". What abbreviation is OFS? OFfSet ?
+The delta data is a sequence of instructions to reconstruct an object +from the base object.
As said before the base object is of type 1..4, we do not "delta-on-delta" yet, but to construct the object we have to create the base object first, which itself can be represented as a deltified object leading to a delta chain.
Each instruction appends more and more data to +the target object until it's complete. There are two supported +instructions so far: one for copy a byte range from the source object +and one for inserting new data embedded in the instruction itself.
ok. So there are 2 types of instructions, "copy from (offset, size)" and "new data follows". The next paragraphs seem to describe the copy instruction, maybe add a sub-headline here?
+Each instruction has variable length. Instruction type is determined +by the seventh bit of the first octet. The following diagrams follow +the convention in RFC 1951 (Deflate compressed data format). + + +----------+---------+---------+---------+---------+-------+-------+-------+ + | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | + +----------+---------+---------+---------+---------+-------+-------+-------+ + +This is the instruction format to copy a byte range from the source +object. It encodes the offset to copy from any the number of bytes to +copy. Offset and size are in little-endian order. + +All offset and size bytes are optional. This is to reduce the +instruction size when encoding small offsets or sizes. The first seven +bits in the first octet determines which of the next seven octets is +present. If bit zero is set, offset1 is present. If bit one is set +offset2 is present and so on. + +Note that a more compact instruction does not change offset and size +encoding. For example, if only offset2 is omitted like below, offset3 +still contains bits 16-23. It does not become offset2 and contains +bits 8-15 even if it's right next to offset1. + + +----------+---------+---------+ + | 10000101 | offset1 | offset3 | + +----------+---------+---------+
It reads very fluently to here.
+In its most compact form, this instruction only takes up one byte +(0x80) with both offset and size omitted, which will have default +values zero. There is another exception: size zero is automatically +converted to 0x10000.
This "another exception" sounds a bit tacked on, but is still understandable. I would imagine that the size of 0 is used frequently to copy large blocks and coincidentally it is represented using the lowest number of bytes for size. Cute! Before the next diagram we could have a sub-headline, indicating that the other instruction "new data follows" will now be described.
+ +----------+============+ + | 0xxxxxxx | data | + +----------+============+ + +This is the instruction to construct target object without the base +object. The following data is appended to the target object. The first +seven bits of the first octet determines the size of data in +bytes. The size must be non-zero.
This command sounds very easy. However we can have at most 127 bytes of new data, so if someone adds a larger part of new code, we'd have many "insert new data" instructions, all at the max of 127, such that the overhead for instruction bytes is 1/127 = 0.7 %. Sounds efficient.
+ +----------+============ + | 00000000 | + +----------+============ + +This is the instruction reserved for future expansion.
Thanks for pointing this out.
+/* + * Values in this enum (except those outside the 3 bit range) are part + * of pack file format. See Documentation/technical/pack-format.txt + * for more information. + */
Makes sense. I really like this patch very much. Thanks for writing it. My annotations are just to add the cherry onto the cake, the current form is Reviewed-by: Stefan Beller <redacted> Thanks!