Thread (47 messages) 47 messages, 7 authors, 2022-07-01

Re: [PATCH 2/2] archive: avoid spawning `gzip`

From: Johannes Schindelin <hidden>
Date: 2019-05-03 20:49:44

Hi Ævar,

On Thu, 2 May 2019, Ævar Arnfjörð Bjarmason wrote:
On Fri, Apr 26 2019, Johannes Schindelin wrote:
quoted
On Sat, 13 Apr 2019, brian m. carlson wrote:
quoted
On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
quoted
I wondered how you were going to kick this in, since users can
define arbitrary filters. I think it's kind of neat to
automagically convert "gzip -cn" (which also happens to be the
default). But I think we should mention that in the Documentation,
in case somebody tries to use a custom version of gzip and wonders
why it isn't kicking in.

Likewise, it might make sense in the tests to put a poison gzip in
the $PATH so that we can be sure we're using our internal code, and
not just calling out to gzip (on platforms that have it, of
course).

The alternative is that we could use a special token like ":zlib"
or something to indicate that the internal implementation should be
used (and then tweak the baked-in default, too). That might be less
surprising for users, but most people would still get the benefit
since they'd be using the default config.
I agree that a special value (or NULL, if that's possible) would be
nicer here. That way, if someone does specify a custom gzip, we honor
it, and it serves to document the code better. For example, if
someone symlinked pigz to gzip and used "gzip -cn", then they might
not get the parallelization benefits they expected.
I went with `:zlib`. The `NULL` value would not really work, as there
is no way to specify that via `archive.tgz.command`.

About the symlinked thing: I do not really want to care to support
such hacks.
It's the standard way by which a lot of systems do this, e.g. on my
Debian box:

    $ find /{,s}bin /usr/{,s}bin -type l -exec file {} \;|grep /etc/alternatives|wc -l
    108

To write this E-Mail I'm invoking one such symlink :)
I am well aware of the way Debian-based systems handle alternatives, and I
myself also use something similar to write this E-Mail (but it is not a
symlink, it is a Git alias).

But that's not the hack that I was talking about.

The hack I meant was: if you symlink `gzip` to `pigz` in your `PATH` *and
then expect `git archive --format=tgz` to pick that up*.

As far as I am concerned, the fact that `git archive --format=tgz` spawns
`gzip` to perform the compression is an implementation detail, and not
something that users should feel they can rely on.
quoted
If you want a different compressor than the default (which can
change), you should specify it specifically.
You might want to do so system-wide, or for each program at a time.

I don't care about this for gzip myself, just pointing out it *is* a
thing people use.
Sure.
quoted
quoted
I'm fine overall with the idea of bringing the compression into the
binary using zlib, provided that we preserve the "-n" behavior
(producing reproducible archives).
Thanks for voicing this concern. I had a look at zlib's source code,
and it looks like it requires an extra function call (that we don't
call) to make the resulting file non-reproducible. In other words, it
has the opposite default behavior from `gzip`.
Just commenting on the overall thread: I like René's "new built-in"
patch best.
I guess we now have to diverging votes: yours for the `git archive --gzip`
"built-in" and Peff's for the async code ;-)
You mentioned "new command that we have to support for eternity". I
think calling it "git gzip" is a bad idea. We'd make it "git
archive--gzip" or "git archive--helper", and we could hide building it
behind some compat flag.

Then we'd carry no if/else internal/external code, and the portability
issue that started this would be addressed, no?
Sure.

The async version would leave the door wide open for implementing pigz'
trick to multi-thread the compression, though.
As a bonus we could also drop the "GZIP" prereq from the test suite
entirely and just put that "gzip" in $PATH for the purposes of the
tests.

I spied on your yet-to-be-submitted patches and you could drop GZIP from
the "git archive" tests, but we'd still need it in
t/t5562-http-backend-content-length.sh, but not if we had a "gzip"
compat helper.
We need it at least once for *decompressing* the `--format=tgz` output in
order to compare it to the `--format=tar` output. Besides, I think it is
really important to keep the test that verifies that the output is correct
(i.e. that gzip can decompress it).
There's also a long-standing bug/misfeature in git-archive that I wonder
about: When you combine --format with --remote you can only generate
e.g. tar.gz if the remote is OK with it, if it says no you can't even if
it supports "tar" and you could do the "gz" part locally. Would such a
patch be harder with :zlib than if we always just spewed out to external
"gzip" after satisfying some criteria?
I think it would be precisely the same: you'd still use the same "filter"
code path.

Ciao,
Dscho
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help