Re: b4: unicode control characters -- warn or remove?

From: Konstantin Ryabitsev <hidden>
Date: 2021-11-01 21:03:10
Also in: tools

On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote:

quoted

I think the following would be a sane check:

1. are there unicode control characters (CCs) present?
2. are there other characters from RTL languages present in the same line?

if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
true, then it's likely worth a warning.

Maybe even relax #2 to just check for unicode characters above a certain
barrier where RTL languages live. I think everyone will agree that if there
are unicode CCs and no other unicode characters in that same line, it's likely
not a legitimate use of control characters.

If you are worried about malicious patches, then it should be easy for
attackers to add some RTL characters and escape the check...

Well, the point of this attack was to trick the reviewer into accepting code
that the compiler would treat differently (e.g. something that looked to be
inside a comment block is actually outside of it).

So, if attackers include some actual RTL text, then the reviewer would no
longer be (as easily) tricked because there would be stuff other than just
invisible characters in the line of code.

This actually similar to how we treat unicode domains. Most browsers only
allow unicode domains when the entire domain name consists of unicode
characters. I suggest we take a similar approach.

-K

Attachments

signature.asc [application/pgp-signature] 228 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help