[darcs-users] UTF-16 (was: Default binary masks)
Trevor Talbot
quension at mac.com
Wed Nov 26 03:04:58 UTC 2003
On Tuesday, Nov 25, 2003, at 17:47 US/Pacific, Sean E. Russell wrote:
> On Tuesday 25 November 2003 19:59, Trevor Talbot wrote:
>> UTF-16 is just as stateless and error resistant as UTF-8 within a
>> resistant on the codepoint level. For example, UTF-8 is not
>> stateless and error resistant in the face of a bit stream.
>
> Error restistant at the codepoint level, yes. If you have a stream
> realiable at the level of 2 bytes, then UTF-16 is error resistant.
> However, UTF-16 is still not stateless. I, once again, state the
> trivial case: you need the byte order from the start of the stream.
> Since you need information from the start of the stream, UTF-16 is, by
> definition, not stateless.
This is a valid point.
On a practical note, if it is labeled out of band, then you know that
the stream is UTF-16BE or UTF-16LE by the same token that you know it
is UTF-8. That mitigates the requirement of information from the start
of the stream, but it does not remove the "byte order" state itself.
>> dealing with 16 bit codepoints. If your underlying file access is on
>> an octet basis (as it would be in most of the systems in this
>> discussion), then you read, write and move 2 octets at a time on that
>
> This assumes that you're starting at the front of the stream, which is
> why UTF-16 isn't stateless. If you have to "start counting" from the
> start of the stream, it isn't stateless. It is only stateless if you
> know you are dropping into the stream on a double-byte boundry, which
> is much less probable than a that you're dropping in on a byte > boundry.
If you know the stream is UTF-16, then you always drop into the stream
on a double-byte boundary. You never deal with UTF-16 data on a
per-byte basis. The requirements for (not) counting are exactly the
same as UTF-8 on a processing level -- the codepoint is the basic unit
of data, not the byte. Anything else is low-level implementation that
has nothing to do with Unicode processing.
> My point was that inserting a spurrious byte (or deleting one) in the
> sequence of a UTF-8 stream will only have local effects -- it will not
> affect the interpretation of characters more than 6 bytes away in any
> direction. If a byte is added or dropped in a UTF-16 stream, the
> UTF-16 decoder will get a wrong boundry count, and the entire stream
> (after the error) will be corrupted.
>
> However, your earlier point is valid: the same could be said about
> spurrious bits in a UTF-8 stream. The significant issue is that
> UTF-16 is *less* resistant than UTF-8, because the the domain of
> potential corrupting events is larger. Anything that will screw up a
> UTF-8 stream will screw up a UTF-16 stream... and there are things
> that will screw up a UTF-16 stream that will not screw up a UTF-8
> stream.
Good point.
More information about the darcs-users
mailing list