Re: [PATCH] grep: skip UTF8 checks explicitally
From: Johannes Schindelin <hidden>
Date: 2019-07-23 12:47:36
Hi Carlo, On Mon, 22 Jul 2019, Carlo Arenas wrote:
On Mon, Jul 22, 2019 at 12:42 PM Ævar Arnfjörð Bjarmason [off-list ref] wrote:quoted
On Mon, Jul 22 2019, Johannes Schindelin wrote:quoted
So I am fine with this patch.I'm not, not because I dislike the approach. I haven't made up my mind yet.my bad, I should had explained better the regression I was trying to fix with this patch : $ git version git version 2.22.0.709.g102302147b.dirty $ git grep "Nguyễn Thái" .mailmap:Nguyễn Thái Ngọc Duy [off-list ref] fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set $ git grep -I "Nguyễn Thái" .mailmap:Nguyễn Thái Ngọc Duy [off-list ref] po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> po/vi.po:# Nguyễn Thái Ngọc Duy [off-list ref], 2012. t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits not 0x80
This is worrisome. The entire reason why we switch on Unicode mode is because it matters for regular expressions. Take for example this one: Nguyễ*n See how the ễ is supposed to be present an arbitrary number of times (including 0 times)? If PCRE2 was to interpret this as UTF-8, it would understand that the bytes 0xe1 0xbb 0x85 had to be repeated (or be missing), but if it interprets it _not_ as UTF-8, then it would handle _only the last_ byte, 0x85, that way, which will not work correctly. So when PCRE2 complains about the top two bits not being 0x80, it fails to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are indeed 0x80). Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this does not happen here... But then, I don't need the `-I` option, and my output looks like this: -- snip -- $ git grep --perl-regexp "Nguyễn Thái" .mailmap:Nguyễn Thái Ngọc Duy [off-list ref] po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> po/vi.po:# Nguyễn Thái Ngọc Duy [off-list ref], 2012. t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy -- snap -- No error, you see? Or maybe it is a problem with the locale? Is your locale using an encoding different than UTF-8?
quoted
I stopped paying attention to this error-with-not-JIT discussion when I heard that some other series going into next for Windows fixed that issue[1] But now we have it again in some form? My ab/no-kwset has a lot of tests for encodings & locales combined with grep, don't some of those trigger this? If so we should make any such failure a test & part of this patch.I don't have a windows environment to test, and I admit I wasn't following clearly the whole conversation but at least in my case, I never got any test to fail, and I haven't seen any test with broken UTF-8.
The problem was not Windows-specific. It was specific to PCRE2 versions compiled without JIT support (which was the case in Git for Windows' SDK, but I fixed this on July 5th). Your patch adequately addresses this problem: instead of erroring out in the non-JIT version and passing in the JIT version, it will let also the non-JIT version pass. But still, I think the issue you found merits some deeper investigation, methinks.
quoted
As noted in [2] I'd be inclined to go the other way, if we indeed have some cases where PCRE skips its own checks does not dying actually give us anything useful? I'd think not, so just ignoring the issue seems like the wrong thing to do.we could reenable the checks by moving out of the JIT fast path in pcre so that pcre2_match is used for all matches (with JIT compiled as an optimization) and that might have the added benefit of solving the PCRE errors when JIT is broken[1] $ git version git version 2.22.0 $ git grep "Nguyễn Thái" .mailmap:Nguyễn Thái Ngọc Duy [off-list ref] po/TEAMS:Members: Nguyễn Thái Ngọc Duy <pclouds AT gmail.com> po/vi.po:# Nguyễn Thái Ngọc Duy [off-list ref], 2012. t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy important to note that at least on my system (macOS 10.14.6) the output of all engines for grep is the same for a UTF-8 pattern match, even if using invalid UTF-8 in the corpus, and I would expect that to be a better indicative of correctness
Right. It is _inconsistent_ at the moment, and with your patch it would at least be consistent again. So I still would prefer your patch over "going the other way".
FWIW GNU grep (with -P) also ignores UTF-8 errors using the same flag and even PCRE seems to be going in the other direction with the recent addition of PCRE2_MATCH_INVALID_UTF[1]
Good point! Ciao, Dscho
$ od -b int.p 0000000 102 145 154 303 263 156 012 303 $ pcre2grep -U 'Beló' int.p Belón Carlo [1] https://public-inbox.org/git/20181209230024.43444-1-carenas@gmail.com/ [2] https://lists.exim.org/lurker/message/20190528.141422.2af1e386.gl.html