Offlist: Re: [darcs-users] Default binary masks

Sean E Russell ser at germane-software.com
Tue Nov 25 16:39:05 UTC 2003


I started to take this off-list; I'm not sure how much relevance this has to 
darcs, although if darcs is going to start getting into encoding support, 
then this is probably an important discussion.

On Tuesday 25 November 2003 02:56, Trevor Talbot wrote:
[quotes in this response are out of order]
> > * Incompatability with 7-bit ASCII
>
> This is a deliberate break.  As a result, it does not have to go
> through great encoding pains -- this equates to faster and simpler
> processing.

Can you explain this for me?  What encoding pain does it avoid by not being 7-
bit ASCII compatible?  

UTF-16 is a variable-length encoding technique.  Or, put another way, the 
character length of a UTF-16 string is not just the (number of bytes/2)-2.  
Just like with UTF-8, to count the character length, you have to scan the 
entire UTF-16 string.

In particular,  UTF-16 characters can be either 2 bytes wide, or 4 bytes wide.  
Furthermore, the bytes in the 4 byte sequences can also be subsequences of a 
2 byte sequence, which makes it impossible to guarantee that a subsequence of 
UTF-16 bytes can be correctly decoded.

For example, the LSB bytes:

	... D8 00 DC 00 ...

are a single UTF-16 character, but the LSB bytes:

	... 42 D8 00 DC 00 7B ...

are 3 UTF-16 characters.  What this means is that...

> within the 16bit unit.  As a result of the simple surrogate encoding,
> you need to scan a maximum of 1 codepoint in either direction to
> retrieve the full character you landed on, or a maximum of 3 codepoints
> to retrieve the next or previous characters.  Compare to UTF-8 scanning.

... this is not true.  Since (a) the number of bytes per character is 
variable, and (b) the byte sex is variable, and (c) subsequences are 
ambiguous, UTF-16 is *not* random access, and you can't correct it by 
scanning in either direction.  You must scan the entire UTF-16 stream up to a 
character to be able to decode that character.

UTF-8, by comparison, is stateless and error resistant[1].  Inserting or 
deleting characters from a UTF-8 stream will cause only localized corruption.  
Inserting or deleting characters from a UTF-16 stream can potentially corrupt 
the entire stream, rendering it uninterpretable.  A trivial example is 
deleting the first two characters of the stream.

> You generally process a file in 16bit units, so byte offsets never
> enter the picture; you just need to know whether to swap the bytes

Not in my neck of the woods.  Ruby and C both define a "character" as a single 
byte.  Ruby, C, and Java byte stream accessors all return single bytes 
(although Java returns the bytes as ints, which are 16 bit, the ints only 
contain 8 significant bits).  So, I'm curious as to where you are "generally" 
processing files in 16 bit units.

Just as an aside, it is stated in several places that UTF-16 is particularly 
bad for Unix systems, because of the occurance of normally "illegal" bytes 
(such as 0x00) in "text" files.

--- SER
[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html





More information about the darcs-users mailing list