[darcs-users] Latin vs. Unicode

Stephen J. Turnbull stephen at xemacs.org
Mon Nov 17 04:34:09 UTC 2014


Ben Franksen writes:

 > Over the last years, unicode has established itself world-wide and firmly 
 > and is well supported by all the major operating systems. This is why I vote 
 > for dropping support for older 8-bit encodings that are not unicode 
 > compatible, thereby allowing e.g. Chinese users to use Darcs with their 
 > native languages.

Does "just dropping 8-bit support" actually enable that, or does it
only work in a .UTF8 locale?  Or does it even work at all?  I have
trouble imagining how a random 8 bit encoding would get passed in
verbatim to a widechar Unicode string, which can then be cast to an
8-bit encoding that actually comes out the way it went in.  8-bit
encodings (including Latin-1) must be recoded to Unicode, or they
probably violate the UTF-8 format (eg, the sequence ASCII-characters
latin-1-character ASCII-character can never be valid UTF-8, but it's
extremely common in Latin-1 text).

Nor do I think you can count on command lines having a .UTF-8 locale.
Shift JIS and to some extent EUC-JP remain popular in Japan, and at
least my Chinese students frequently use Big5 and the GB family or
encodings.  All of these have repertoires that are Unicode subsets,
but the encodings are different.  Users expect to be able to "cat"
them to the terminal and read them, and for that use case they will
have a locale that specifies a default charset other than UTF-8.  Most
terminals are not able to switch encodings on the fly, so this can be
extremely inconvenient.

I'm not saying it's not worth doing, but be prepared for quite a bit
more work than "just dropping 8-bit support."



More information about the darcs-users mailing list