Thread (87 messages) 87 messages, 19 authors, 2021-05-12

Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Date: 2021-05-11 09:37:44
Also in: alsa-devel, dri-devel, intel-gfx, intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel, linux-edac, linux-ext4, linux-f2fs-devel, linux-fpga, linux-hwmon, linux-iio, linux-input, linux-integrity, linux-media, linux-pci, linux-pm, linux-rdma, linux-riscv, linux-usb, lkml, netdev, rcu

Em Mon, 10 May 2021 15:22:02 -0400
"Theodore Ts'o" [off-list ref] escreveu:
On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
quoted
On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:  
quoted
This patch series is doing conversion only when using ASCII makes
more sense than using UTF-8. 

See, a number of converted documents ended with weird characters
like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
character doesn't do any good.

Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
someone tries to use grep[1].  
Replacing those makes sense. But replacing emdashes — which are a
distinct character that has no direct replacement in ASCII and which
people do *deliberately* use instead of hyphen-minus — does not.  
I regularly use --- for em-dashes and -- for en-dashes.  Markdown will
automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
hyphens to en-dashes.  It's much, much easier for me to type 2 or 3
hypens into my text editor of choice than trying to enter the UTF-8
characters. 
Yeah, typing those UTF-8 chars are a lot harder than typing -- and ---
on several text editors ;-)

Here, I only type UTF-8 chars for accents (my US-layout keyboards are 
all set to US international, so typing those are easy).
If we can make sphinx do this translation, maybe that's
the best way of dealing with these two characters?
Sphinx already does that by default[1], using smartquotes:

	https://docutils.sourceforge.io/docs/user/smartquotes.html

Those are the conversions that are done there:

      - Straight quotes (" and ') turned into "curly" quote characters;
      - dashes (-- and ---) turned into en- and em-dash entities;
      - three consecutive dots (... or . . .) turned into an ellipsis char.

So, we can simply use single/double commas, hyphens and dots for
curly commas and ellipses.

[1] There's a way to disable it at conf.py, but at the Kernel this is
    kept on its default: to automatically do such conversions. 

Thanks,
Mauro
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help