Thread (10 messages) 10 messages, 4 authors, 2019-07-24

Re: [PATCH] grep: skip UTF8 checks explicitally

From: Ævar Arnfjörð Bjarmason <hidden>
Date: 2019-07-24 18:22:54

On Wed, Jul 24 2019, Johannes Schindelin wrote:
Hi Carlo,

On Tue, 23 Jul 2019, Carlo Arenas wrote:
quoted
On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
[off-list ref] wrote:
quoted
So when PCRE2 complains about the top two bits not being 0x80, it fails
to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
indeed 0x80).
the error is confusing but it is not coming from the pattern, but from
what PCRE2 calls
the subject.

meaning that while going through the repository it found content that
it tried to match but
that it is not valid UTF-8, like all the png and a few txt files that
are not encoded as
UTF-8 (ex: t/t3900/ISO8859-1.txt).
quoted
Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
does not happen here... But then, I don't need the `-I` option, and my
output looks like this:
-I was just an attempt to workaround the obvious binary files (like
PNG); I'll assume you
should be able to reproduce if using a non JIT enabled PCRE2,
regardless of version.

my point was that unlike in your report, I didn't have any test cases
failing, because
AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
actually valid zero terminated UTF-8 strings)
Thank you for this explanation. I think it makes a total lot of sense.

So your motivation for this patch is actually a different one than mine,
and I would like to think that this actually strengthens the case _in
favor_ of it. The patch kind of kills two birds with one stone.
This patch is really the wrong thing to do. Don't get me wrong, I'm
sympathetic to the *problem* and it should be solved, but this isn't the
solution.

The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
UTF-8 string so you, PCRE, don't need to re-check it". To quote
pcre2api(3):

    If you know that your pattern is a valid UTF string, and you want to
    skip this check for performance reasons, you can set the
    PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
    in‐ valid UTF string as a pattern is undefined. It may cause your
    program to crash or loop.

(Later it's discussed that "pattern" here is also "subject string" in
the context of pcre2_{jit_,}match()).

I know almost nothing about the internals of PCRE's engine, but much of
it's based on Perl's, which I know way better. Doing the equivalent of
this in perl (setting the UTF8 flag on a SV) *will* cause asserts to
fail and possibly segfaults.

It's likely through dumb luck that this is "working". I.e. yes the JIT
mode is less anal about these checks, so if you say grep for "Nguyễn
Thái" in UTF-8 mode and there's binary data you're satisfied not to find
anything in that binary data.

But if you are I'm willing to bet this ruins your day, e.g PCRE would
"skip ahead" a character 4-byte character because it sees a telltale
U+10000 through U+10FFFF start sequence, except that wasn't a character,
it was some arbitrary binary.

Now, what is the solution? I don't have any patches yet, but things I
intend to look at:

 1) We're oversupplying PCRE2_UTF now, and one such case is what's being
    reported here. I.e. there's no reason I can think of for why a
    fixed-string pattern should need PCRE2_UTF set when not combined
    with --ignore-case. We can just not do that, but maybe I'm missing
    something there.

 2) We can do "try utf8, and fallback". A more advanced version of this
    is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
    does. I was thinking something closer to just carrying two compiled
    patterns, and falling back on the ~PCRE2_UTF one if we get a
    PCRE2_ERROR_UTF8_* error.

One reason we can't "just" go back to the pre-ab/no-kwset behavior is
that one thing it does is fix a long-standing bug where we'd do the
wrong thing under locales && -i && UTF-8 string/pattern. More precisely
we'd punt it to the C library's matching function, which would probably
do the wrong thing.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help