Thread (4 messages) 4 messages, 3 authors, 2023-04-02

Re: regex compilation error with --color-words

From: René Scharfe <hidden>
Date: 2023-03-31 20:45:25

Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
On 30-Mar-23 1:55, Eric Sunshine wrote:
quoted
I'm encountering a failure on macOS High Sierra 10.13.6 when using
--color-words:
The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine.  The same also happens with the FreeBSD one (tested on 14.0).

The issue can be replicated through the following sequence of commands.

git init color-words
cd color-words
echo '*.pl   diff=perl' >.gitattributes
echo 'print 42;' >t.pl
git add t.pl
git commit -am Add
git show --color-words
Or in Git's own repo:

   $ git log -p --color-words --no-merges '*.c'
   Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
   commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
   [...]

The error disappears when localization is turned off:

   $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
   # just finishes without an error

The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
removed that the macros PATTERNS and IPATTERN in userdiff.c append.

So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
unless it's in ASCII mode (LANG=C).  664d44ee7f (userdiff: simplify
word-diff safeguard, 2011-01-11) explains that this part exists to match
a multi-byte UTF-8 character.  With a regcomp(1) that supports
multi-byte characters natively they need to be specified differently, I
guess, perhaps like this "[^\x00-\x7f]"?
Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms.

egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null
No idea how to specify non-ASCII bytes in shell or regex.  '\xNN' does
not seem to do the trick.  printf(1) interpretes octal numbers, though:

   $ echo ö | egrep $(printf "[\200-\377]")
   egrep: illegal byte sequence

(The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut
short; the "ö" is OK.)

René
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help