Re: regex compilation error with --color-words
From: René Scharfe <hidden>
Date: 2023-03-31 20:45:25
Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
On 30-Mar-23 1:55, Eric Sunshine wrote:quoted
I'm encountering a failure on macOS High Sierra 10.13.6 when using --color-words:The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine. The same also happens with the FreeBSD one (tested on 14.0). The issue can be replicated through the following sequence of commands. git init color-words cd color-words echo '*.pl diff=perl' >.gitattributes echo 'print 42;' >t.pl git add t.pl git commit -am Add git show --color-words
Or in Git's own repo: $ git log -p --color-words --no-merges '*.c' Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+ commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8 [...] The error disappears when localization is turned off: $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null # just finishes without an error The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is removed that the macros PATTERNS and IPATTERN in userdiff.c append. So it seems regcomp(1) on macOS doesn't like invalid Unicode characters unless it's in ASCII mode (LANG=C). 664d44ee7f (userdiff: simplify word-diff safeguard, 2011-01-11) explains that this part exists to match a multi-byte UTF-8 character. With a regcomp(1) that supports multi-byte characters natively they need to be specified differently, I guess, perhaps like this "[^\x00-\x7f]"?
Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms. egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null
No idea how to specify non-ASCII bytes in shell or regex. '\xNN' does not seem to do the trick. printf(1) interpretes octal numbers, though: $ echo ö | egrep $(printf "[\200-\377]") egrep: illegal byte sequence (The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut short; the "ö" is OK.) René