[darcs-users] UTF-16 (was: Default binary masks)

Sean E. Russell ser at germane-software.com
Wed Nov 26 01:47:12 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 25 November 2003 19:59, Trevor Talbot wrote:
> The 1 - 6 codepoint variable encoding.  By being "less" variable,
> UTF-16 becomes more efficient to process.

Well, that's true.

> UTF-16 is just as stateless and error resistant as UTF-8 within a
> resistant on the codepoint level.  For example, UTF-8 is not stateless
> and error resistant in the face of a bit stream.

Error restistant at the codepoint level, yes.  If you have a stream realiable 
at the level of 2 bytes, then UTF-16 is error resistant.  However, UTF-16 is 
still not stateless.  I, once again, state the trivial case: you need the 
byte order from the start of the stream.  Since you need information from the 
start of the stream, UTF-16 is, by definition, not stateless.

> dealing with 16 bit codepoints.  If your underlying file access is on
> an octet basis (as it would be in most of the systems in this
> discussion), then you read, write and move 2 octets at a time on that

This assumes that you're starting at the front of the stream, which is why 
UTF-16 isn't stateless.  If you have to "start counting" from the start of 
the stream, it isn't stateless.  It is only stateless if you know you are 
dropping into the stream on a double-byte boundry, which is much less 
probable than a that you're dropping in on a byte boundry.

> This goes back to having to interoperate with byte-oriented tools.  If
> it's necessary, UTF-16 is a bad choice.  It's worth a note that Apple

Yeah, I didn't say it was impossible; just that it is considered a "bad 
choice".

> default.  Unicode in an OS is one of those things that needs to be
> supported by most APIs to be really useful.

You and I completely agree here.  Without OS support -- or, even, language 
support, dealing with Unicode is a pain.

> necessarily have to be at the beginning of the stream.  UTF-16 is
> assumed to be in Big Endian format unless otherwise marked (either out
> of band, or with the first codepoint of the stream being 0xFFFE).

Which is fine... except -- as I understand -- Windows encodes all UTF-16 files 
in Little Endian format.  That's enough, IME, to render the convention 
useless.

> This is not a requirement.  Underlying storage of codepoints is beyond
> the scope of Unicode processing.  If a byte (for UTF-16) or a bit (for
> UTF-8) is inserted or deleted, you have corrupt data, and there is
> nothing you can do at the Unicode processing level to correct it.  This
> is the domain of storage and transfer systems.

My point was that inserting a spurrious byte (or deleting one) in the sequence 
of a UTF-8 stream will only have local effects -- it will not affect the 
interpretation of characters more than 6 bytes away in any direction.  If a 
byte is added or dropped in a UTF-16 stream, the UTF-16 decoder will get a 
wrong boundry count, and the entire stream (after the error) will be 
corrupted.

However, your earlier point is valid: the same could be said about spurrious 
bits in a UTF-8 stream.  The significant issue is that UTF-16 is *less* 
resistant than UTF-8, because the the domain of potential corrupting events 
is larger.  Anything that will screw up a UTF-8 stream will screw up a UTF-16 
stream... and there are things that will screw up a UTF-16 stream that will 
not screw up a UTF-8 stream.

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/xAYgP0KxygnleI8RAhTrAJ43en+qQUavMi+xvzYXwX9l3re7kACdH5IT
0xehVlEIEdSM3grWME1apvw=
=yugG
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list