Thread (46 messages) 46 messages, 10 authors, 2023-02-06

Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2023-01-31 11:54:08

On Tue, Jan 31 2023, brian m. carlson wrote:
Part of the reason I think this is valuable is that once SHA-1 and
SHA-256 interoperability is present, git archive will change the
contents of the archive format, since it will embed a SHA-256 hash into
the file instead of a SHA-1 hash, since that's what's in the repository.
Thus, we can't produce an archive that's deterministic in the face of
SHA-1/SHA-256 interoperability concerns, and we need to create a new
format that doesn't contain that data embedded in it.
I don't see why a format change would be required in this context.

If a repository were to switch over to SHA-256 wouldn't a better
solution to this be to disambiguate whether you're requesting a SHA-1 or
SHA-256 derived archive in the URL? E.g. to never serve up an archive
with a SHA-256 embedded in the header at:

	https://github.com/git/git/archive/refs/tags/v2.39.1.tar.gz

But require a URL like:

	https://github.com/git/git/archive-sha256/refs/tags/v2.39.1.tar.gz

If you did that then existing archives would continue to have the same
byte-for-byte content (assuming that the result of this discussion is
that we support that forever), but they'd always be generated with "-c
extensions.objectFormat=sha1". For always-SHA256 repos such a URL would
fail to generate anything.

But for repos that used to be SHA-1 but are now SHA-256 either URL would
work, but the PAX header would be different, referring to the SHA-1 or
SHA-256 commit, respectively.

Whereas your proposal seems to be that we should omit that SHA-(1|256)
from the "comment" entirely. That would seem to require either a one-off
change of all existing archives, or some cut-off date (or other marker).

If you've got a cut-off, you could also just use it to decide whether to
generate a SHA-1 or SHA-256 archive, and without that you'd be back to
the one-off breakage.

I also find it very useful that we've got the commit OID in the archive,
as it allows for round-tripping from archives back to the relevant
repository commit. Losing that entirely for SHA-1<->SHA-256 interop
would be unfortunate, especially if it turns out we could have easily
kept it
Having said that, I don't think this should be based on the timestamp of
the file, since that means that two otherwise identical archives
differing in timestamp aren't ever going to be the same, and we do see
people who import or vendor other projects.
Yes, I agree that doing this by that sort of heuristic would be bad.
Nor do I think we should
attempt to provide consistent compression, since I believe the output of
things like zlib has changed in the past, and we can't continually carry
an old, potentially insecure version of zlib just because the output
changed.  People should be able to implement compression using gzip,
zlib, pigz, miniz_oxide, or whatever if they want, since people
implement Git in many different languages, and we won't want to force
people using memory-safe languages like Go and Rust to explicitly use
zlib for archives.
As I noted in the side-thread I think an acceptable solution would be to
push the problem of the consistent compressor downstream. I.e. if a site
like GitHub wants to maintain a potentially old version of GNU gzip that
should be up to them.

But I think it's a valid concern that we should guarantee the stability
of the archive format.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help