Thread (16 messages) 16 messages, 3 authors, 2025-01-28

Re: man/man7/pathname.7: Correct handling of pathnames

From: Alejandro Colomar <alx@kernel.org>
Date: 2025-01-27 17:37:22

[CC += наб]

Hi Jason,

On Mon, Jan 27, 2025 at 12:14:43PM -0500, Jason Yundt wrote:
On Mon, Jan 27, 2025 at 04:53:10PM +0100, Alejandro Colomar wrote:
quoted
Right.  But then, when do you need to do encoding?
Personally, my preference is that programs use the locale’s codeset
because I can override the locale codeset in the rare event that UTF-8
isn’t the correct option.  In my previous example, I was able to set the
LANG environment variable to jp_JP.SJIS so that I could run that old
software in an environment where pathnames were encoded in Shift-JIS.
If everything just always assumed a particular character encoding for
pathnames, then I wouldn’t have been able to do that.
But if the program handles arbitrary strings, just like the kernel does,
that would work too.
quoted
quoted
quoted
-  Accept anything, but reject control characters.
-  Accept anything, just like the kernel.
These last two also aren’t quite complete recommendations.  If a GUI
program wants to display a pathname on the screen, then what character
encoding should it use when decoding the bytes?
Just print them as they got in.  No decoding.  Send the raw bytes to
write(2) or printf(3) or whatever.
I don’t think that printing is a good way for GUI applications to
display text.  I don’t normally run GUI applications in a terminal, so
I’m not normally able to see a GUI application’s stdout or stderr.  Most
of the GUI applications that I use display pathnames as part of a larger
window.  In order to do that, the GUI application needs to know which
characters the bytes in the pathname represent so that the GUI
application can draw those characters on the screen.
I would do in a GUI exactly the same as what command-line programs do:
pass the raw string to whatever API prints them.  If the string makes
sense in the current locale, it will be shown nicely.  If it doesn't
make sense, it will display weird characters, but that's not a terrible
issue.  Just run again with the appropriate locale.

For example, in the git repository of the Linux man-pages project, there
are commits authored by наб [off-list ref].  
Whenever I see the git-log(1) in one of my systems with the C locale, I
see weird characters.  I just need to re-run with the C.UTF-8 locale.

But it handles the bytes correctly, even if they don't make sense to the
system.  If git(1) failed whenever a string doesn't make sense in the
current locale, the repo would be corrupted sooner than later.


Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help