[darcs-users] UTF-16 (was: Default binary masks)

Trevor Talbot quension at mac.com
Wed Nov 26 00:59:51 UTC 2003

On Tuesday, Nov 25, 2003, at 08:39 US/Pacific, Sean E Russell wrote:

> I started to take this off-list; I'm not sure how much relevance this 
> has to darcs, although if darcs is going to start getting into 
> encoding support, then this is probably an important discussion.

Since I received these via the list, I'll continue to reply here.  
Perhaps we should just say, "if David says be quiet, we will" :)

> On Tuesday 25 November 2003 02:56, Trevor Talbot wrote:

>>> * Incompatability with 7-bit ASCII
>> This is a deliberate break.  As a result, it does not have to go 
>> through great encoding pains -- this equates to faster and simpler 
>> processing.
> Can you explain this for me?  What encoding pain does it avoid by not 
> being 7-bit ASCII compatible?

The 1 - 6 codepoint variable encoding.  By being "less" variable, 
UTF-16 becomes more efficient to process.

> UTF-16 is a variable-length encoding technique.  Or, put another way, 
> the character length of a UTF-16 string is not just the (number of 
> bytes/2)-2.  Just like with UTF-8, to count the character length, you 
> have to scan the entire UTF-16 string.


> UTF-8, by comparison, is stateless and error resistant[1].  Inserting 
> or deleting characters from a UTF-8 stream will cause only localized 
> corruption.  Inserting or deleting characters from a UTF-16 stream can 
> potentially corrupt the entire stream, rendering it uninterpretable.

UTF-16 is just as stateless and error resistant as UTF-8 within a 
stream.  The key with both of them is that they are stateless and error 
resistant on the codepoint level.  For example, UTF-8 is not stateless 
and error resistant in the face of a bit stream.

>> You generally process a file in 16bit units, so byte offsets never 
>> enter the picture; you just need to know whether to swap the bytes
> Not in my neck of the woods.  Ruby and C both define a "character" as 
> a single byte.  Ruby, C, and Java byte stream accessors all return 
> single bytes (although Java returns the bytes as ints, which are 16 
> bit, the ints only contain 8 significant bits).  So, I'm curious as to 
> where you are "generally" processing files in 16 bit units.

Note that while C does define a "character" as a byte, it does not 
define a "byte" as an "octet".  Anywhere you process UTF-16, you are 
dealing with 16 bit codepoints.  If your underlying file access is on 
an octet basis (as it would be in most of the systems in this 
discussion), then you read, write and move 2 octets at a time on that 
level.  But all Unicode processing occurs on the codepoints, not the 
octets.  (I'm switching back to "byte" for the rest of this, since 
everyone knows we mean "octet".)

C also has a notion of a "wide character", and appropriate functions to 
deal with it, such as fgetws().  Unfortunately gcc and glibc's default 
wchar_t width seems to be 32 bits, which makes it inappropriate for 
UTF-16 processing.  Windows compilers use a 16bit width.

> Just as an aside, it is stated in several places that UTF-16 is 
> particularly bad for Unix systems, because of the occurance of 
> normally "illegal" bytes (such as 0x00) in "text" files.

This goes back to having to interoperate with byte-oriented tools.  If 
it's necessary, UTF-16 is a bad choice.  It's worth a note that Apple 
has managed the "break" in many places within OS X, though it's true 
that those places don't really touch the BSD layer.  It's also true 
that both OS X and Windows NT run on Unicode-aware filesystems by 
default.  Unicode in an OS is one of those things that needs to be 
supported by most APIs to be really useful.

On Tuesday, Nov 25, 2003, at 12:43 US/Pacific, Sean E Russell wrote:

> On Tuesday 25 November 2003 12:21, Kevin Smith wrote:
>> As much as I dislike UTF-16 files, it appears from reading the 
>> standard[1] that you can, in fact, scan forward or back just a little 
>> bit to find the start of a character. As with UTF-8, they designed it 
>> such that the first part of a multi-part sequence falls within a 
>> unique range. And, subsequent parts of a multi-part sequence are also 
>> identifiable as such.
> Only on a two byte boundry.  The two bytes that make up the unique 
> value are such that the first byte is a legal second byte of another 
> UTF-16 character, and the second byte is a legal first byte of another 
> UTF-16 character.

A byte-oriented approach is an incorrect way to use UTF-16.  With 
UTF-16, you always deal with 16bit codepoints, never bytes.

> And you (of course) also have to know the byte sex, which information 
> is located at the start of the stream.

This is a read-once piece of state information, and it doesn't 
necessarily have to be at the beginning of the stream.  UTF-16 is 
assumed to be in Big Endian format unless otherwise marked (either out 
of band, or with the first codepoint of the stream being 0xFFFE).

> And you have to be sure that the stream hasn't had any bytes inserted 
> or deleted, which would skew your byte boundry.

This is not a requirement.  Underlying storage of codepoints is beyond 
the scope of Unicode processing.  If a byte (for UTF-16) or a bit (for 
UTF-8) is inserted or deleted, you have corrupt data, and there is 
nothing you can do at the Unicode processing level to correct it.  This 
is the domain of storage and transfer systems.

More information about the darcs-users mailing list