Offlist: Re: [darcs-users] Default binary masks
Sean E Russell
ser at germane-software.com
Tue Nov 25 16:39:05 UTC 2003
I started to take this off-list; I'm not sure how much relevance this has to
darcs, although if darcs is going to start getting into encoding support,
then this is probably an important discussion.
On Tuesday 25 November 2003 02:56, Trevor Talbot wrote:
[quotes in this response are out of order]
> > * Incompatability with 7-bit ASCII
> This is a deliberate break. As a result, it does not have to go
> through great encoding pains -- this equates to faster and simpler
Can you explain this for me? What encoding pain does it avoid by not being 7-
bit ASCII compatible?
UTF-16 is a variable-length encoding technique. Or, put another way, the
character length of a UTF-16 string is not just the (number of bytes/2)-2.
Just like with UTF-8, to count the character length, you have to scan the
entire UTF-16 string.
In particular, UTF-16 characters can be either 2 bytes wide, or 4 bytes wide.
Furthermore, the bytes in the 4 byte sequences can also be subsequences of a
2 byte sequence, which makes it impossible to guarantee that a subsequence of
UTF-16 bytes can be correctly decoded.
For example, the LSB bytes:
... D8 00 DC 00 ...
are a single UTF-16 character, but the LSB bytes:
... 42 D8 00 DC 00 7B ...
are 3 UTF-16 characters. What this means is that...
> within the 16bit unit. As a result of the simple surrogate encoding,
> you need to scan a maximum of 1 codepoint in either direction to
> retrieve the full character you landed on, or a maximum of 3 codepoints
> to retrieve the next or previous characters. Compare to UTF-8 scanning.
... this is not true. Since (a) the number of bytes per character is
variable, and (b) the byte sex is variable, and (c) subsequences are
ambiguous, UTF-16 is *not* random access, and you can't correct it by
scanning in either direction. You must scan the entire UTF-16 stream up to a
character to be able to decode that character.
UTF-8, by comparison, is stateless and error resistant. Inserting or
deleting characters from a UTF-8 stream will cause only localized corruption.
Inserting or deleting characters from a UTF-16 stream can potentially corrupt
the entire stream, rendering it uninterpretable. A trivial example is
deleting the first two characters of the stream.
> You generally process a file in 16bit units, so byte offsets never
> enter the picture; you just need to know whether to swap the bytes
Not in my neck of the woods. Ruby and C both define a "character" as a single
byte. Ruby, C, and Java byte stream accessors all return single bytes
(although Java returns the bytes as ints, which are 16 bit, the ints only
contain 8 significant bits). So, I'm curious as to where you are "generally"
processing files in 16 bit units.
Just as an aside, it is stated in several places that UTF-16 is particularly
bad for Unix systems, because of the occurance of normally "illegal" bytes
(such as 0x00) in "text" files.
More information about the darcs-users