[darcs-users] UTF-16 (was: Default binary masks)
Sean E. Russell
ser at germane-software.com
Wed Nov 26 01:47:12 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Tuesday 25 November 2003 19:59, Trevor Talbot wrote:
> The 1 - 6 codepoint variable encoding. By being "less" variable,
> UTF-16 becomes more efficient to process.
Well, that's true.
> UTF-16 is just as stateless and error resistant as UTF-8 within a
> resistant on the codepoint level. For example, UTF-8 is not stateless
> and error resistant in the face of a bit stream.
Error restistant at the codepoint level, yes. If you have a stream realiable
at the level of 2 bytes, then UTF-16 is error resistant. However, UTF-16 is
still not stateless. I, once again, state the trivial case: you need the
byte order from the start of the stream. Since you need information from the
start of the stream, UTF-16 is, by definition, not stateless.
> dealing with 16 bit codepoints. If your underlying file access is on
> an octet basis (as it would be in most of the systems in this
> discussion), then you read, write and move 2 octets at a time on that
This assumes that you're starting at the front of the stream, which is why
UTF-16 isn't stateless. If you have to "start counting" from the start of
the stream, it isn't stateless. It is only stateless if you know you are
dropping into the stream on a double-byte boundry, which is much less
probable than a that you're dropping in on a byte boundry.
> This goes back to having to interoperate with byte-oriented tools. If
> it's necessary, UTF-16 is a bad choice. It's worth a note that Apple
Yeah, I didn't say it was impossible; just that it is considered a "bad
choice".
> default. Unicode in an OS is one of those things that needs to be
> supported by most APIs to be really useful.
You and I completely agree here. Without OS support -- or, even, language
support, dealing with Unicode is a pain.
> necessarily have to be at the beginning of the stream. UTF-16 is
> assumed to be in Big Endian format unless otherwise marked (either out
> of band, or with the first codepoint of the stream being 0xFFFE).
Which is fine... except -- as I understand -- Windows encodes all UTF-16 files
in Little Endian format. That's enough, IME, to render the convention
useless.
> This is not a requirement. Underlying storage of codepoints is beyond
> the scope of Unicode processing. If a byte (for UTF-16) or a bit (for
> UTF-8) is inserted or deleted, you have corrupt data, and there is
> nothing you can do at the Unicode processing level to correct it. This
> is the domain of storage and transfer systems.
My point was that inserting a spurrious byte (or deleting one) in the sequence
of a UTF-8 stream will only have local effects -- it will not affect the
interpretation of characters more than 6 bytes away in any direction. If a
byte is added or dropped in a UTF-16 stream, the UTF-16 decoder will get a
wrong boundry count, and the entire stream (after the error) will be
corrupted.
However, your earlier point is valid: the same could be said about spurrious
bits in a UTF-8 stream. The significant issue is that UTF-16 is *less*
resistant than UTF-8, because the the domain of potential corrupting events
is larger. Anything that will screw up a UTF-8 stream will screw up a UTF-16
stream... and there are things that will screw up a UTF-16 stream that will
not screw up a UTF-8 stream.
- --
### SER
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE/xAYgP0KxygnleI8RAhTrAJ43en+qQUavMi+xvzYXwX9l3re7kACdH5IT
0xehVlEIEdSM3grWME1apvw=
=yugG
-----END PGP SIGNATURE-----
More information about the darcs-users
mailing list