Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
From: René Scharfe <hidden>
Date: 2022-06-16 18:55:57
Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
On Wed, Jun 15 2022, René Scharfe wrote:quoted
Git uses zlib for its own object store, but calls gzip when creating tgz archives. Add an option to perform the gzip compression for the latter using zlib, without depending on the external gzip binary. Plug it in by making write_block a function pointer and switching to a compressing variant if the filter command has the magic value "git archive gzip". Does that indirection slow down tar creation? Not really, at least not in this test: $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \ './git -C ../linux archive --format=tar HEAD # {rev}'Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/ (local) I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing easier. You'll find that you need less or no --warmup with it, since the checkout flip-flopping and re-making (and resulting FS and other cache eviction) will go away, as we'll use different "git worktree"'s for the two "rev".
OK, but requiring hyperfine alone is burden enough for reviewers. I had a try anyway and it took me a while to realize that git-hyperfine requires setting the Git config option hyperfine.run-dir band that it ignores it on my system. Had to hard-code it in the script.
(Also, putting those on a ramdisk really helps)quoted
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD Time (mean ± σ): 4.044 s ± 0.007 s [User: 3.901 s, System: 0.137 s] Range (min … max): 4.038 s … 4.059 s 10 runs Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main Time (mean ± σ): 4.047 s ± 0.009 s [User: 3.903 s, System: 0.138 s] Range (min … max): 4.038 s … 4.066 s 10 runs How does tgz creation perform? $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \ './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD' Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD Time (mean ± σ): 20.404 s ± 0.006 s [User: 23.943 s, System: 0.401 s] Range (min … max): 20.395 s … 20.414 s 10 runs Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD Time (mean ± σ): 23.807 s ± 0.023 s [User: 23.655 s, System: 0.145 s] Range (min … max): 23.782 s … 23.857 s 10 runs Summary './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran 1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD' So the internal implementation takes 17% longer on the Linux repo, but uses 2% less CPU time. That's because the external gzip can run in parallel on its own processor, while the internal one works sequentially and avoids the inter-process communication overhead. What are the benefits? Only an internal sequential implementation can offer this eco mode, and it allows avoiding the gzip(1) requirement.I had been keeping one eye on this series, but didn't look at it in any detail. I found this after reading 6/6, which I think in any case could really use some "why" summary, which seems to mostly be covered here. I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a reference to the GZIP test dependency, or that our users are unlikely to have "gzip(1)" on their systems.
It's to avoid a run dependency; the build/test dependency remains.
If it's the latter I'd much rather (as a user) take a 17% wallclock improvement over a 2% cost of CPU. I mostly care about my own time, not that of the CPU.
Understandable, and you can set tar.tgz.command='gzip -cn' to get the old behavior. Saving energy is a better default, though. The runtime in the real world probably includes lots more I/O time. The tests above are repeated and warmed up to get consistent measurements, but big repos are probably not fully kept in memory like that.
Can't we have our 6/6 cake much easier and eat it too by learning a "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work use the "internal" one?
Interesting idea, but I think the existing config option suffices. E.g. a distro could set it in the system-wide config file if/when gzip is installed.
Re the "eco mode": I also wonder how much of the overhead you're seeing for both that 17% and 2% would go away if you pin both processes to the same CPU, I can't recall the command offhand, but IIRC taskset or numactl can do that. I.e. is this really measuring IPC overhead, or I-CPU overhead on your system?
I'd expect that running git archive and gzip at the same CPU core takes more wall-clock time than using zlib because inflating the object files and deflating the archive are done sequentially in both scenarios. Can't test it on macOS because it doesn't offer a way to pin programs to a certain core, but e.g. someone with access to a Linux system can check that using taskset(1). René