Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
From: Elijah Newren <hidden>
Date: 2022-07-08 01:53:17
On Tue, Jul 5, 2022 at 6:08 AM Dian Xu [off-list ref] wrote:
Hi Elijah,
Hi Dian, Please don't top post on this list. It'd also help to respond to the relevant email instead of picking a different email in the thread to put your answers in. Anyway, that aside...
Please see answers below: 1. H: 2.27m; S: 7.7k; Total: 2.28m 2. Sure I will run 'reapply' after the sparse-checkout file has changed. Just curious, do I have to run 'reapply' if 'checkout' is the next immediate cmd? I thought 'checkout' does the updating index as well 3. I simply added one file only, 'git add' and 'git add --sparse' still hang. Let me know if you need me to send you any debug info from pathspec.c/dir.c 4. Good to know and we are investigating if we have a way out from --no-cone 5. I should've been clearer: The experiment done here uses 2.37.0
Thanks for providing these details. It was enough to at least get me
started, and from my experiments, it appears the arguments to `git
add` are important. In particular, I could not trigger this when
passing actual filenames that existed. I could when passing a fake
filename. Here's the concrete steps I used to reproduce:
git clone git@github.com:newren/gvfs-like-git-bomb
cd gvfs-like-git-bomb
git init attempt
cd attempt
../make-a-git-bomb.sh
time git checkout bomb
echo "/*" >.git/info/sparse-checkout
echo '!/bomb/j/j/' >>.git/info/sparse-checkout
for i in $(seq 1 10000); do
printf '!some/random/file/path-%05d\n' $i
done >>.git/info/sparse-checkout
git config core.sparseCheckout true
time git sparse-checkout reapply
echo hello >world
time git add --sparse world nonexistent
time git rm --cached --sparse world nonexistent
time git add world nonexistent
time git rm --cached world nonexistent
This sequence of steps will (1) clone a repo with 2 files, (2) create
another repository in subdirectory 'attempt' that has 1000001 files
(but only two unique files, and only six or so unique trees) in a
branch called 'bomb', (3) check it out, (4) create 10002 patterns for
the sparse-checkout file (only the first 2 of which match anything)
which will leave ~99% of files still present (990001 files checked out
and 10000 files sparse) and turn on sparsity, (5) measure how long it
takes to add and remove a file from the index, both with and without
the --sparse flag, but always listing an extra path that won't match
anything.
The timings I see for the setup steps are:
4m10.444s checkout bomb
1m0.380s sparse-checkout reapply
And the timings for the add/rm steps are:
4m43.353s add --sparse world nonexistent
9m25.666s add world nonexistent
0m0.129s rm --cached --sparse world nonexistent
9m23.601s rm --cached world nonexistent
which shows that 'rm' also has a performance problem without the
'--sparse' flag (which seems like another bug).
Now, if I remove the 'nonexistent' argument from the commands, then
the timings drop to:
0m0.236s add --sparse world
0m0.233s add world
0m0.175s rm --cached --sparse world
4m43.744s rm --cached world
So, I can reproduce some slowness. 'rm' without --sparse seems
buggily slow for either set, whereas 'add' is only slow when given a
fake path. You never mentioned anything about the arguments you were
passing to `git add`, so I don't know whether you are using specific
filenames that just don't exist (like I did above), or globs that
perhaps match some files, or something else. That might be useful to
know. But there appears to be something here for both 'add' and 'rm'
that we could look into optimizing. I don't have time right now. I'm
not sure if someone else has some time to look into it; if no one else
does, I'll eventually try to get back to it.