Re: Suspected git grep regression in git 2.40.0
From: Mathias Krause <hidden>
Date: 2023-03-21 19:21:19
Possibly related (same subject, not in this thread)
- 2023-03-21 · Re: Suspected git grep regression in git 2.40.0 · Bagas Sanjaya <hidden>
- 2023-03-21 · Suspected git grep regression in git 2.40.0 · Stephane Odul <hidden>
On 21.03.23 17:33, Junio C Hamano wrote:
[jc: on the CC: line, summoned a few people who may know a lot more about pcre than I do] Stephane Odul [off-list ref] writes:quoted
We have a CI pipeline on a private repository that started failing consistently while running `git grep -P` commands. The command fails with an exit code of -11 and is failing pretty consistently. With prior versions of git there is no issue whatsoever, but with 2.40.0 it always fails on the same call. Other git grep calls are fine, but that one is failing consistently. I was not able to reproduce locally but my main machine is an M1 MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the container is based on Ubuntu 20.04 with the git client installed via the PPA. The error is for this pattern `git grep -cP '^\w+ = json.load'`. As a workaround we tried to download and install the microsoft-git v2.39.2 deb package since it allows us to downgrade, but then the git grep commands just got stuck.One "grep -P" related change we had recently between 2.39 and 2.40 was 50b6ad55 (grep: fall back to interpreter if JIT memory allocation fails, 2023-01-31) The code tries to figure out at runtime if pcre engine has functioning JIT by making an extra JIT compilation of a sample pattern and when it fails with a specific reason, fall back to interpreted pattern matching. The change made the code to emit a bit more detailed information when it fails, but a controlled exit from the codepath should give $?=128, not 11. So the above commit may or may not be related. It could be that the version of pcre library linked to Git 2.40 and older Git you are running in your CI environment has been updated.
I don't think it's that commit. I just tried the pattern here on a system that
would require the interpreter fallback and one that wouldn't and both didn't
error out (Debian bookworm/sid; libpcre2-8: 10.42-1).
Looking at the history, though, another commit stands out:
$ git log --oneline --no-merges v2.39.0..v2.40.0 -- grep.c
fb2ebe72a374 grep API: plug memory leaks by freeing "header_list"
891c9965fbc0 grep.c: refactor free_grep_patterns()
50b6ad55b04d grep: fall back to interpreter if JIT memory allocation fails
acabd2048ee0 grep: correctly identify utf-8 characters with \{b,w} in -P
Commit acabd2048ee0 suspiciously matches the pattern with its "\w" mention.
Looking at the diff gives:
if (!opt->ignore_locale && is_utf8_locale() && !literal)
- options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+ options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
So maybe the PCRE2_UCP option isn't supported by the system in question or it
forces PCRE2 to do a verification it didn't do before?
The exit code of -11 could relate to PCRE2's error PCRE2_ERROR_UTF8_ERR9 (-11)
strengthening the suspicion it's related to that change. It's described as:
PCRE2_ERROR_UTF8_ERR9 5th-byte's two top bits are not 0x80
Maybe the input data isn't valid UTF-8 encoded?
Stephane, can you pin point the file that the grep is failing on and verify it
is indeed a valid UFT-8 encoded file via:
$ iconv -f UTF-8 your_file > /dev/null && echo OK || echo "Not valid"
If this prints "OK", we need to investigate further. However, if it prints
"Not valid", a broken file encoding is probably the cause and we should
revisit the above change and think of alternatives, as this might be seen as
a regression (even though the file isn't strictly valid UTF-8).
Thanks,
Mathias
Does it make a difference if you disable JIT by prefixing the pattern with (*NO_JIT)? Thanks.