Thread (41 messages) 41 messages, 5 authors, 2021-05-17

Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)

From: Mauro Carvalho Chehab <mchehab@kernel.org>
Date: 2021-05-07 08:04:41

Em Fri, 7 May 2021 08:39:24 +0200
Mauro Carvalho Chehab [off-list ref] escreveu:
Em Thu, 6 May 2021 14:21:01 -0700
Randy Dunlap [off-list ref] escreveu:
quoted
On 5/6/21 11:08 AM, Matthew Wilcox wrote:  
quoted
On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:    
quoted
I have been going thru some of the Documentation/ files...

Why do several of the files begin with
(hex) ef bb bf    followed by "=================="
for a heading, instead of just "===================".
See e.g. Documentation/timers/no_hz.rst.    
No idea! It seems that the text editor I used on that time added
it for whatever reason.
quoted
quoted
00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|

ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
https://en.wikipedia.org/wiki/Byte_order_mark

We should delete it.
    
OK, thanks, I have started on that.


Just another question: ("inquiring minds want to know")

Why is/are some docs using U+2217 '*' instead of ASCII '*'?
E.g., Documentation/block/cdrom-standard.rst.  
The cdrom doc is a very special case: it was originally written in LaTeX.
I don't remember any other document in LaTeX inside the Kernel docs during
the conversions I made. See:
	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")

In order to convert it to .rst, I used some tool to first turn it
into plain text (probably LaTeX, but I don't remember anymore), and then
I manually reviewed the entire file, adding ReST tags where needed.

I didn't realize that utf-8 chars were used instead of normal ASCII chars,
as both appear the same when editing it[1].

[1] I use Fedora here. Fedora changed the default charset to utf-8 a long
    time ago.

Anyway, we should be able of get rid of weird UTF-8 chars from it with:

	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst

I'll prepare a patch fixing it. Some care should be taken, however, as
it has two places where UTF-8 chars should be used[2].

[2] There are two German person names that use UTF-8 chars:
    - 'o' + umlat;
    - a LATIN SMALL LETTER SHARP S (Eszett)
Btw, I did a quick check here: excluding translations, there are 182
files with UTF-8 chars at next-20210429. It seems that most of them
are on files that got converted from DocBook and html.

Several of them are valid ones: the ones used on names 
(like Günther, Alcôve, ...). 

Those should remain as-is.

Several Docbook/html converted documents contain UTF-8 NO-BREAK SPACE 
and other invisible chars, like the byte order mark (BOM) pointed
by Randy.

Those should be replaced (or removed for non-printable ones).

-

Now, there are other cases where I'm not sure if there's a
consensus:

1. UTF-8 is used where there's an ASCII similar (but with
   a different graph symbol), like:

	- UTF-8 commas;
	- UTF-8 hyphen chars, including the long ones:
	  FIGURE DASH, EN DASH, EM DASH

   IMO, those should also be converted.

2. Some UTF-8 symbols, like:

	- ® 
	- ™
	- ² - used mainly for I²C
	- …
	- ⬍ ↑ ↓   
	- µs - used for microsseconds

   I would keep those.

3. There are couple of places which uses UTF-8 graphic characters, like:

        /sys/devices/system/edac/
        ├── mc
        │   ├── mc0
        │   │   ├── ce_count
        │   │   ├── ce_noinfo_count

   This is the normal output of the "tree" command on machines with UTF-8.
   I would keep it. 

   Yet, iconv converts it into:

        /sys/devices/system/edac/
        +-- mc
        |   +-- mc0
        |   |   +-- ce_count
        |   |   +-- ce_noinfo_count

   which would also be fine. So, replacing those would be no-brain,
   but I probably newer documents will be written using such symbols. 

   So, I would preserve the UTF-8 graphics characters.

I'm preparing a patchset to address the UTF-8 issues on the top of
today's next, but before posting, it seems reasonable to discuss
what to do with the above cases. Comments?

Thanks,
Mauro
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help