Thread (46 messages) 46 messages, 10 authors, 2023-02-06

Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution

From: brian m. carlson <hidden>
Date: 2023-01-31 09:55:04

On 2023-01-31 at 00:06:44, Eli Schwartz wrote:
Nevertheless, I've seen the sentiment a few times that git doesn't like
committing to output stability of git-archive, because it isn't
officially documented (but it's not entirely clear what the benefits of
changing are). And yet, git endeavors to do so, in order to prevent
unnecessary breakage of people who embody Hyrum's Law and need that
stability.
I'm one of the GitHub employees who chimed in there, and I'm also a Git
contributor in my own time (and I am speaking here only in my personal
capacity, since this is a personal address).  I made a change some years
back to the archive format to fix the permissions on pax headers when
extracted as files, and kernel.org was relying on that and broke.  Linus
yelled at me because of that.

Since then, I've been very opposed to us guaranteeing output format
consistency without explicitly doing so.  I had sent some patches before
that I don't think ever got picked up that documented this explicitly.
I very much don't want people to come to rely on our behaviour unless we
explicitly guarantee it.
What does everyone think about offering versioned git-archive outputs?
This could be user-selectable as an option to `git archive`, but the
main goal would be to select a good versioned output format depending on
what is being archived. So:

- first things first, un-default the internal compressor again
- implement a v2 archive format, where the internal compressor is the
  default -- no other changes
- teach git to select an archive format based on the date of the object
  being archived
  - when given a commit/tag ID to archive, check which support frame the
    committer date falls inside
  - for tree IDs, always use the latest format (it always uses the
    current date anyway)
- schedule a date, for the sake of argument, 6 months after the next
  scheduled release date of git version X.Y in which this change goes
  live; bake this into the git sources as a transition date, all commits
  or tags generated after this date fall into the next format support
  frame
I am actually very much in favour of providing a standard, deterministic
version of pax (the extended tar format) that we use and documenting it
as a standard so that other archive tools can use that.  That is, we
document some canonical tar format that is bit-for-bit identical that we
(and hopefully GNU tar and libarchive) will agree should be used to
serialize files for software interchange.  I don't think this should be
dependent on the date at all, but I do believe it should be versioned
and tested, and the version number embedded as a pax header.  I think
this would be valuable for simply having reproducible archives in
general, including for things like Docker containers, Debian packages,
Rust crates, and more, and I'm happy to work with others on such a
format, as I've said in the past on the list.  People can opt-in to
whatever format they want when creating an archive and continue to use
that forever if they like.

Part of the reason I think this is valuable is that once SHA-1 and
SHA-256 interoperability is present, git archive will change the
contents of the archive format, since it will embed a SHA-256 hash into
the file instead of a SHA-1 hash, since that's what's in the repository.
Thus, we can't produce an archive that's deterministic in the face of
SHA-1/SHA-256 interoperability concerns, and we need to create a new
format that doesn't contain that data embedded in it.

Having said that, I don't think this should be based on the timestamp of
the file, since that means that two otherwise identical archives
differing in timestamp aren't ever going to be the same, and we do see
people who import or vendor other projects.  Nor do I think we should
attempt to provide consistent compression, since I believe the output of
things like zlib has changed in the past, and we can't continually carry
an old, potentially insecure version of zlib just because the output
changed.  People should be able to implement compression using gzip,
zlib, pigz, miniz_oxide, or whatever if they want, since people
implement Git in many different languages, and we won't want to force
people using memory-safe languages like Go and Rust to explicitly use
zlib for archives.

That may mean that it's important for people to actually decompress the
archive before checking hashes if they want deterministic behaviour, and
I'm okay with that.  You already have to do that if you're verifying the
signature on Git tarballs, since only the uncompressed tar archive is
signed, so I don't think this is out of the question.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help