Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Date: 2021-05-14 08:21:33
Also in:
alsa-devel, dri-devel, intel-gfx, intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel, linux-doc, linux-edac, linux-f2fs-devel, linux-hwmon, linux-iio, linux-input, linux-integrity, linux-media, linux-pci, linux-pm, linux-rdma, linux-usb, lkml, netdev, rcu
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse [off-list ref] escreveu:
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:quoted
Such conversion tools - plus some text editor like LibreOffice or similar - have a set of rules that turns some typed ASCII characters into UTF-8 alternatives, for instance converting commas into curly commas and adding non-breakable spaces. All of those are meant to produce better results when the text is displayed in HTML or PDF formats.And don't we render our documentation into HTML or PDF formats?
Yes.
Are some of those non-breaking spaces not actually *useful* for their intended purpose?
No.
The thing is: non-breaking space can cause a lot of problems.
We even had to disable Sphinx usage of non-breaking space for
PDF outputs, as this was causing bad LaTeX/PDF outputs.
See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
The afore mentioned patch disables Sphinx default behavior of
using NON-BREAKABLE SPACE on literal blocks and strings, using this
special setting: "parsedliteralwraps=true".
When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
the media uAPI docs were violating the document margins by far,
causing texts to be truncated.
So, please **don't add NON-BREAKABLE SPACE**, unless you test
(and keep testing it from time to time) if outputs on all
formats are properly supporting it on different Sphinx versions.
-
Also, most of those came from conversion tools, together with other
eccentricities, like the usage of U+FEFF (BOM) character at the
start of some documents. The remaining ones seem to came from
cut-and-paste.
For instance, bibliographic references (there are a couple of
those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty
sure that those came from cut-and-pasting the document titles
from their names at the original PDF documents or web pages that
are referenced.
quoted
While it is perfectly fine to use UTF-8 characters in Linux, and specially at the documentation, it is better to stick to the ASCII subset on such particular case, due to a couple of reasons: 1. it makes life easier for tools like grep;Barely, as noted, because of things like line feeds.
You can use grep with "-z" to seek for multi-line strings(*), Like:
$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
Documentation/RCU/Design/Data-Structures/Data-Structures.rst
(*) Unfortunately, while "git grep" also has a "-z" flag, it
seems that this is (currently?) broken with regards of handling multilines:
$ git grep -Pzl 'grace period started,\s*then'
$
quoted
2. they easier to edit with the some commonly used text/source code editors.That is nonsense. Any but the most broken and/or anachronistic environments and editors will be just fine.
Not really.
I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
on the US-intl keyboard settings, that allow me to type as "'a" for á.
However, there's no shortcut for non-Latin UTF-codes, as far as I know.
So, if would need to type a curly comma on the text editors I normally
use for development (vim, nano, kate), I would need to cut-and-paste
it from somewhere[1].
[1] If I have a table with UTF-8 codes handy, I could type the UTF-8
number manually... However, it seems that this is currently broken
at least on Fedora 33 (with Mate Desktop and US intl keyboard with
dead keys).
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
test it for *years*, as I din't see any reason why I would
need to type UTF-8 characters by numbers until we started
this thread.
In practice, on the very rare cases where I needed to write
non-Latin utf-8 chars (maybe once in a year or so, Like when I
would need to use a Greek letter or some weird symbol), there changes
are high that I wouldn't remember its UTF-8 code.
So, If I need to spend time to seek for an specific symbol, after
finding it, I just cut-and-paste it.
But even in the best case scenario where I know the UTF-8 and
<CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
comma, the keystroke sequence would be:
<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
That's a lot harder than typing and has a higher chances of
mistakenly add a wrong symbol than just typing:
"some string"
Knowing that both will produce *exactly* the same output, why
should I bother doing it the hard way?
-
Now, I'm not arguing that you can't use whatever UTF-8 symbol you
want on your docs. I'm just saying that, now that the conversion
is over and a lot of documents ended getting some UTF-8 characters
by accident, it is time for a cleanup.
Thanks,
Mauro