Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation

From: René Scharfe <hidden>
Date: 2022-06-24 20:25:27

Am 24.06.22 um 13:13 schrieb Ævar Arnfjörð Bjarmason:

On Thu, Jun 16 2022, René Scharfe wrote:

quoted

Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:

quoted

[...]

Understandable, and you can set tar.tgz.command='gzip -cn' to get the
old behavior.  Saving energy is a better default, though.

I disagree with that in general, a big reason for why git won out over
other VCS's is that it wasn't as slow. I think we should primarily be
interested in the time a user might end up staring at the screen.

I understand the concern to have "git archive" just work, e.g. if you
uninstall gzip(1) (although that seems rather obscure, but perhaps this
is for more minimal setups).

The previous attempt came from/via Git on Windows.

I don't think saving energy is a virtue, *maybe* it is, but maybe your
computer is powered by hydro, solar or nuclear instead of coal, so even
if we're taking global energy policy into account for changes to git
it's highly context dependant.

Or a device runs on battery power and saving energy keeps it running a
bit longer.  Or it's housed in a data center and saving energy helps
reduce cooling requirements.

In any case, this is also true for pretty much any other git command
that might spawn processes or threads, e.g. "git grep":

	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
	  Range (min … max):    38.2 ms …  41.8 ms    10 runs

	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
	  Range (min … max):    26.6 ms …  31.2 ms    10 runs

	Summary
	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'

Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
but we're 40% slower.

So this is a bit of a digression, but this particular thing seems much
better left to the OS or your hardware's CPU throttling policy. To the
extent that we care perhaps more fitting would be to have a global
core.wrapper-cmd option or something, so you could pass all git commands
through "taskset" (or your local equivalent), or just use shell aliases.

Not sure what conclusion to draw from these numbers.  Perhaps that
computation is not the bottleneck here (increasing the number of cores by
700% increases speed only by 40%)?  That coordination overhead makes up a
big percentage and there might be room for improvement/tuning?

In any case, I agree we should leave scheduling decisions at runtime to
the OS.

quoted

The runtime in the real world probably includes lots more I/O time.  The
tests above are repeated and warmed up to get consistent measurements,
but big repos are probably not fully kept in memory like that.

On top of that I guess only few people create tgz files at all.  Most of
them I would expect to be created automatically (and cached) by sites
like kernel.org.  So I imagine people rather create tar.xz, tar.zst or
zip archives these days.  Or use git at both ends (push/pull), as they
should. ;-)  I have no data to support this guess, though.

But yeah, the tradeoff sounds a bit weird: Give 17% duration, get 2% CPU
time back -- sounds like a ripoff.  In your example below it's 12%
longer duration for 5% saved CPU time, which sounds a bit better, but
still not terribly attractive.

Look at it from a different angle: This basic sequential implementation
is better for non-interactive tgz creation due to its slightly lower
CPU usage, which we cannot achieve with any parallel process setup.
It's easier to deploy because it doesn't need gzip.  Its runtime hit
isn't *that* hard, and people interested primarily in speed should
parallelize the expensive part, deflate, not run the cheap tar creation
parallel to a single-threaded deflate.  I.e. they should already run
pigz (https://zlib.net/pigz/).

$ hyperfine -L gz gzip,pigz -w3 'git -C ../linux archive --format=tar HEAD | {gz} -cn'
Benchmark 1: git -C ../linux archive --format=tar HEAD | gzip -cn
  Time (mean ± σ):     20.764 s ±  0.007 s    [User: 24.119 s, System: 0.606 s]
  Range (min … max):   20.758 s … 20.781 s    10 runs

Benchmark 2: git -C ../linux archive --format=tar HEAD | pigz -cn
  Time (mean ± σ):      6.077 s ±  0.023 s    [User: 29.708 s, System: 1.599 s]
  Range (min … max):    6.037 s …  6.125 s    10 runs

Summary
  'git -C ../linux archive --format=tar HEAD | pigz -cn' ran
    3.42 ± 0.01 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -cn'

quoted

Can't we have our 6/6 cake much easier and eat it too by learning a
"fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
use the "internal" one?

Interesting idea, but I think the existing config option suffices.  E.g.
a distro could set it in the system-wide config file if/when gzip is
installed.

I think in practice distros are unlikely to have such triggers for
"package X is installed, let's set config Y". I mean, e.g. Debian can do
that with its packaging system, but it's expecting a lot.

I don't *expect* any reaction either way, but packagers *can* go with a
custom config if they see the need.

Why not flip
the default depending on if start_command() fails?

Because it's harder to test and support due to its more complicated
behavior, and I don't see why it would be needed.

quoted

Re the "eco mode": I also wonder how much of the overhead you're seeing
for both that 17% and 2% would go away if you pin both processes to the
same CPU, I can't recall the command offhand, but IIRC taskset or
numactl can do that. I.e. is this really measuring IPC overhead, or
I-CPU overhead on your system?

I'd expect that running git archive and gzip at the same CPU core takes
more wall-clock time than using zlib because inflating the object files
and deflating the archive are done sequentially in both scenarios.
Can't test it on macOS because it doesn't offer a way to pin programs to
a certain core, but e.g. someone with access to a Linux system can check
that using taskset(1).

Here's a benchmark, this is your hyperfine command, just with taskset
added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):

	hyperfine -w3 \
		-L cpus 0,0-7 \
		-L command 'gzip -cn','git archive gzip' \
		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'

Which gives me:

	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
	  Range (min … max):    1.522 s …  1.622 s    10 runs

	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
	  Range (min … max):    1.298 s …  1.388 s    10 runs

	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
	  Range (min … max):    1.462 s …  1.572 s    10 runs

	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
	  Range (min … max):    1.469 s …  1.542 s    10 runs

	Summary
	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'

Whic I think should control for the IPC overhead v.s. the advantage of
multicore. I.e. we're faster with "gzip -cn" on multicore, but the
internal implementation has an advantage when it comes to

Right, #1, #3 and #4 all run sequentially, but #1 has the pipe overhead
to deal with as well, which adds 5 percentage points to its runtime.

René

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help