Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese

From: Jeff King <hidden>
Date: 2022-11-29 18:23:38

On Tue, Nov 29, 2022 at 08:32:58PM +0900, Junio C Hamano wrote:

Ævar Arnfjörð Bjarmason [off-list ref] writes:

quoted

or (if chinese can not be displayed correctly)

-  <E4><B8><BA>1
+  <E4><B8><BA>2

Actual result of "git diff --color-words"

<E4><B8>[-<BA>1-]{+<BA>2+}
...

I think we could provide new ways to do per-language diffs, right now
you can use --word-diff-regex, but it would be handy to e.g. have a
built-in collection of those (or other non-regex boundary algorithms)
for Chinese etc.

I think you are thinking it with unnecessaarily complexity.  

The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.

No matter what language we are using, we shouldn't do that.

I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.

Agreed that we should probably avoid breaking characters. But what
puzzles me more is that we break it between B8 and BA, and not
elsewhere. Why not between E4 and B8? Why not between BA and "1"?

If the rule is "break on ascii whitespace", then I'd have expected the
whole four-character sequence to be taken as a unit. In other words, it
does should not have to care that a character is, as long as the bytes
for space characters cannot appear inside other characters (which is
true of utf8).

-Peff

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help