[darcs-users] Default binary masks

Sean E. Russell ser at germane-software.com
Tue Nov 25 03:43:52 UTC 2003

Hash: SHA1

On Monday 24 November 2003 17:49, Trevor Talbot wrote:
> UTF-16 is the most balanced of the Unicode formats to deal with.

Why "most balanced"?  It seems to combine the worst of all worlds:

* Incompatability with 7-bit ASCII
* Two byte orderings -- UTF-16 and UNILE; haven't we struggled enough with 
byte ordering problems?
* Overlapping start characters -- given a byte the middle of a file, you have 
to know two pieces of information to start scanning characters: the byte 
count of the byte (so you know if it is high or low) and the byte ordering 
(which is all the way at the start of the byte stream).
* You get none of the advantages of a fixed-width encoding, yet everybody 
seems to think it is a fixed width encoding.

Most of this equates to an inferiority to UTF-8 in terms of interoperability, 
which is why UTF-8 is more common than UTF-16 in transport protocols and data 
storage, and why XML defaults to UTF-8 encoding.

> It's also NT's native format.  Most non-trivial Unicode APIs on other
> platforms seem to use UTF-16 as well.

I admit to a western bias, for which UTF-8 is much better suited (being 
compatible with 7-bit ASCII).   In favor of UTF-16, Java uses UTF-16, as does 
ECMAScript (although, if that's not a good reason to *not* choose UTF-16, I 
don't know what is), and UTF-16 is recommended by the Unicode Consortium.  
Despite this, UTF-8 is more widely used than UTF-16; Japanese do prefer 
UTF-16 to UTF-8, but they tend to prefer EUC and Shift-JIS even more, so no 
gains there.

I think this is starting to stray off-topic, though.

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
Version: GnuPG v1.2.1 (GNU/Linux)


More information about the darcs-users mailing list