Re: [PATCH 2/2] archive: avoid spawning `gzip`
From: Jeff King <hidden>
Date: 2019-06-13 19:16:33
On Mon, Jun 10, 2019 at 12:44:54PM +0200, René Scharfe wrote:
Am 01.05.19 um 20:18 schrieb Jeff King:quoted
On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:quoted
quoted
But since the performance is still not quite on par with `gzip`, I would actually rather not, and really, just punt on that one, stating that people interested in higher performance should use `pigz`.Here are my performance numbers for generating .tar.gz files again:OK, tried one more version, with pthreads (patch at the end). Also redid all measurements for better comparability; everything is faster now for some reason (perhaps due to a compiler update? clang version 7.0.1-8 now):
Hmm. Interesting that using pthreads is still slower than just shelling out to gzip:
master, using gzip(1): Benchmark #1: git archive --format=tgz HEAD Time (mean ± σ): 15.697 s ± 0.246 s [User: 19.213 s, System: 0.386 s] Range (min … max): 15.405 s … 16.103 s 10 runs [...] using zlib in a separate thread (that's the new one): Benchmark #1: git archive --format=tgz HEAD Time (mean ± σ): 16.310 s ± 0.237 s [User: 20.075 s, System: 0.173 s] Range (min … max): 15.983 s … 16.790 s 10 runs
I wonder if zlib is just slower. Or if the cost of context switching is somehow higher than just dumping big chunks over a pipe. In particular, our gzip-alike is still faster than pthreads:
using a gzip-lookalike: Benchmark #1: git archive --format=tgz HEAD Time (mean ± σ): 16.289 s ± 0.218 s [User: 19.485 s, System: 0.337 s] Range (min … max): 16.020 s … 16.555 s 10 runs
though it looks like the timings do overlap.
quoted
At GitHub we certainly do cache the git-archive output. We'd also be just fine with the sequential solution. We generally turn down pack.threads to 1, and keep our CPUs busy by serving multiple users anyway. So whatever has the lowest overall CPU time is generally preferable, but the times are close enough that I don't think we'd care much either way (and it's probably not worth having a config option or similar).Moving back to 2009 and reducing the number of utilized cores both feels weird, but the sequential solution *is* the most obvious, easiest and (by a narrow margin) lightest one if gzip(1) is not an option anymore.
It sounds like we resolved to give the "internal gzip" its own name (whether it's a gzip-alike command, or a special name we recognize to trigger the internal code). So maybe we could continue to default to "gzip -cn", but platforms could do otherwise when shipping gzip there is a pain (i.e. Windows, but maybe also anybody else who wants to set NO_EXTERNAL_GZIP or detect it from autoconf). -Peff