Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2023-01-31 11:54:08
On Tue, Jan 31 2023, brian m. carlson wrote:
Part of the reason I think this is valuable is that once SHA-1 and SHA-256 interoperability is present, git archive will change the contents of the archive format, since it will embed a SHA-256 hash into the file instead of a SHA-1 hash, since that's what's in the repository. Thus, we can't produce an archive that's deterministic in the face of SHA-1/SHA-256 interoperability concerns, and we need to create a new format that doesn't contain that data embedded in it.
I don't see why a format change would be required in this context. If a repository were to switch over to SHA-256 wouldn't a better solution to this be to disambiguate whether you're requesting a SHA-1 or SHA-256 derived archive in the URL? E.g. to never serve up an archive with a SHA-256 embedded in the header at: https://github.com/git/git/archive/refs/tags/v2.39.1.tar.gz But require a URL like: https://github.com/git/git/archive-sha256/refs/tags/v2.39.1.tar.gz If you did that then existing archives would continue to have the same byte-for-byte content (assuming that the result of this discussion is that we support that forever), but they'd always be generated with "-c extensions.objectFormat=sha1". For always-SHA256 repos such a URL would fail to generate anything. But for repos that used to be SHA-1 but are now SHA-256 either URL would work, but the PAX header would be different, referring to the SHA-1 or SHA-256 commit, respectively. Whereas your proposal seems to be that we should omit that SHA-(1|256) from the "comment" entirely. That would seem to require either a one-off change of all existing archives, or some cut-off date (or other marker). If you've got a cut-off, you could also just use it to decide whether to generate a SHA-1 or SHA-256 archive, and without that you'd be back to the one-off breakage. I also find it very useful that we've got the commit OID in the archive, as it allows for round-tripping from archives back to the relevant repository commit. Losing that entirely for SHA-1<->SHA-256 interop would be unfortunate, especially if it turns out we could have easily kept it
Having said that, I don't think this should be based on the timestamp of the file, since that means that two otherwise identical archives differing in timestamp aren't ever going to be the same, and we do see people who import or vendor other projects.
Yes, I agree that doing this by that sort of heuristic would be bad.
Nor do I think we should attempt to provide consistent compression, since I believe the output of things like zlib has changed in the past, and we can't continually carry an old, potentially insecure version of zlib just because the output changed. People should be able to implement compression using gzip, zlib, pigz, miniz_oxide, or whatever if they want, since people implement Git in many different languages, and we won't want to force people using memory-safe languages like Go and Rust to explicitly use zlib for archives.
As I noted in the side-thread I think an acceptable solution would be to push the problem of the consistent compressor downstream. I.e. if a site like GitHub wants to maintain a potentially old version of GNU gzip that should be up to them. But I think it's a valid concern that we should guarantee the stability of the archive format.