[darcs-users] UTF-16 (was: Default binary masks)

Trevor Talbot quension at mac.com
Wed Nov 26 03:04:58 UTC 2003


On Tuesday, Nov 25, 2003, at 17:47 US/Pacific, Sean E. Russell wrote:

> On Tuesday 25 November 2003 19:59, Trevor Talbot wrote:

>> UTF-16 is just as stateless and error resistant as UTF-8 within a 
>> resistant on the codepoint level.  For example, UTF-8 is not 
>> stateless and error resistant in the face of a bit stream.
>
> Error restistant at the codepoint level, yes.  If you have a stream 
> realiable at the level of 2 bytes, then UTF-16 is error resistant.  
> However, UTF-16 is still not stateless.  I, once again, state the 
> trivial case: you need the byte order from the start of the stream.  
> Since you need information from the start of the stream, UTF-16 is, by 
> definition, not stateless.

This is a valid point.

On a practical note, if it is labeled out of band, then you know that 
the stream is UTF-16BE or UTF-16LE by the same token that you know it 
is UTF-8.  That mitigates the requirement of information from the start 
of the stream, but it does not remove the "byte order" state itself.

>> dealing with 16 bit codepoints.  If your underlying file access is on 
>> an octet basis (as it would be in most of the systems in this 
>> discussion), then you read, write and move 2 octets at a time on that
>
> This assumes that you're starting at the front of the stream, which is 
> why UTF-16 isn't stateless.  If you have to "start counting" from the 
> start of the stream, it isn't stateless.  It is only stateless if you 
> know you are dropping into the stream on a double-byte boundry, which 
> is much less probable than a that you're dropping in on a byte > boundry.

If you know the stream is UTF-16, then you always drop into the stream 
on a double-byte boundary.  You never deal with UTF-16 data on a 
per-byte basis.  The requirements for (not) counting are exactly the 
same as UTF-8 on a processing level -- the codepoint is the basic unit 
of data, not the byte.  Anything else is low-level implementation that 
has nothing to do with Unicode processing.

> My point was that inserting a spurrious byte (or deleting one) in the 
> sequence of a UTF-8 stream will only have local effects -- it will not 
> affect the interpretation of characters more than 6 bytes away in any 
> direction.  If a byte is added or dropped in a UTF-16 stream, the 
> UTF-16 decoder will get a wrong boundry count, and the entire stream 
> (after the error) will be corrupted.
>
> However, your earlier point is valid: the same could be said about 
> spurrious bits in a UTF-8 stream.  The significant issue is that 
> UTF-16 is *less* resistant than UTF-8, because the the domain of 
> potential corrupting events is larger.  Anything that will screw up a 
> UTF-8 stream will screw up a UTF-16 stream... and there are things 
> that will screw up a UTF-16 stream that will not screw up a UTF-8 
> stream.

Good point.





More information about the darcs-users mailing list