Thread (61 messages) 61 messages, 5 authors, 2018-02-16

Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

From: Eric Sunshine <hidden>
Date: 2018-02-04 04:43:14

On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren [off-list ref] wrote:
On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren [off-list ref] wrote:
quoted
On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller [off-list ref] wrote:
quoted
On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren [off-list ref] wrote:
quoted
+       while (*--end_of_new == *--end_of_old &&
+              end_of_old != old_path &&
+              end_of_new != new_path)
+               ; /* Do nothing; all in the while loop */
Assuming many repos are UTF8 (including in their paths),
how does this work with display characters longer than one char?
It should be fine as we cut at the slash?
Can UTF-8 characters, other than '/', have a byte whose value matches
(unsigned char)('/')?  If so, then I'll need to figure out how to do
utf-8 character parsing.  Anyone have pointers?
Well, after digging around for a while, I found this claim on the
Wikipedia page for UTF-8:

  Since ASCII bytes do not occur when encoding non-ASCII code points
into UTF-8, UTF-8 is safe to use within most programming and document
languages that interpret certain ASCII characters in a special way,
such as "/" in filenames, "\" in escape sequences, and "%" in printf.

So, unless I'm reading something wrong here, I think that means this
code is just fine as it is.
You're reading it correctly. Unicode values greater than \U007f
encoded with UTF-8 will never contain bytes which can be confused with
any 7-bit ASCII character.

It's possible that Stefan was thinking of "combining characters"[1]
which may be "precomposed" and "decomposed"[2], but which appear the
same when rendered. For instance, "ö" might be a single Unicode
codepoint or two codepoints, such as "o" combined with a diaeresis.
It's a potential problem if you're comparing, byte by byte, two
filenames which look the same. However, Git takes pains[3] to avoid
this problem by ensuring (if possible) that filenames are precomposed
within Git even if they happen to be decomposed on the actual
filesystem. So, most likely, your code is okay as-is.

[1]: https://en.wikipedia.org/wiki/Combining_character
[2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic)
[3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help