Thread (47 messages) 47 messages, 7 authors, 2022-07-01

Re: [PATCH 2/2] archive: avoid spawning `gzip`

From: Jeff King <hidden>
Date: 2019-06-13 19:16:33

On Mon, Jun 10, 2019 at 12:44:54PM +0200, René Scharfe wrote:
Am 01.05.19 um 20:18 schrieb Jeff King:
quoted
On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:
quoted
quoted
But since the performance is still not quite on par with `gzip`, I would
actually rather not, and really, just punt on that one, stating that
people interested in higher performance should use `pigz`.
Here are my performance numbers for generating .tar.gz files again:
OK, tried one more version, with pthreads (patch at the end).  Also
redid all measurements for better comparability; everything is faster
now for some reason (perhaps due to a compiler update? clang version
7.0.1-8 now):
Hmm. Interesting that using pthreads is still slower than just shelling
out to gzip:
master, using gzip(1):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     15.697 s ±  0.246 s    [User: 19.213 s, System: 0.386 s]
  Range (min … max):   15.405 s … 16.103 s    10 runs
[...]
using zlib in a separate thread (that's the new one):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.310 s ±  0.237 s    [User: 20.075 s, System: 0.173 s]
  Range (min … max):   15.983 s … 16.790 s    10 runs
I wonder if zlib is just slower. Or if the cost of context switching
is somehow higher than just dumping big chunks over a pipe. In
particular, our gzip-alike is still faster than pthreads:
using a gzip-lookalike:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.289 s ±  0.218 s    [User: 19.485 s, System: 0.337 s]
  Range (min … max):   16.020 s … 16.555 s    10 runs
though it looks like the timings do overlap.
quoted
At GitHub we certainly do cache the git-archive output. We'd also be
just fine with the sequential solution. We generally turn down
pack.threads to 1, and keep our CPUs busy by serving multiple users
anyway.

So whatever has the lowest overall CPU time is generally preferable, but
the times are close enough that I don't think we'd care much either way
(and it's probably not worth having a config option or similar).
Moving back to 2009 and reducing the number of utilized cores both feels
weird, but the sequential solution *is* the most obvious, easiest and
(by a narrow margin) lightest one if gzip(1) is not an option anymore.
It sounds like we resolved to give the "internal gzip" its own name
(whether it's a gzip-alike command, or a special name we recognize to
trigger the internal code). So maybe we could continue to default to
"gzip -cn", but platforms could do otherwise when shipping gzip there is
a pain (i.e. Windows, but maybe also anybody else who wants to set
NO_EXTERNAL_GZIP or detect it from autoconf).

-Peff
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help