[darcs-users] UTF-16 (was: Default binary masks)

Tue Nov 25 17:21:51 UTC 2003

Sean E Russell wrote:
> On Tuesday 25 November 2003 02:56, Trevor Talbot wrote:
>>within the 16bit unit.  As a result of the simple surrogate encoding,
>>you need to scan a maximum of 1 codepoint in either direction to
>>retrieve the full character you landed on, or a maximum of 3 codepoints
>>to retrieve the next or previous characters.  Compare to UTF-8 scanning.
> 
> ... this is not true.  Since (a) the number of bytes per character is 
> variable, and (b) the byte sex is variable, and (c) subsequences are 
> ambiguous, UTF-16 is *not* random access, and you can't correct it by 
> scanning in either direction.  You must scan the entire UTF-16 stream up to a 
> character to be able to decode that character.

As much as I dislike UTF-16 files, it appears from reading the 
standard[1] that you can, in fact, scan forward or back just a little 
bit to find the start of a character. As with UTF-8, they designed it 
such that the first part of a multi-part sequence falls within a unique 
range. And, subsequent parts of a multi-part sequence are also 
identifiable as such.

I agree with your other concerns/complaints about UTF-16.

Although I personally find this topic interesting, and it is marginally 
relevant to darcs, I suspect we would be better off postponing most of 
this discussion until some darcs unicode extensions are imminent.

Kevin

[1] http://www.faqs.org/rfcs/rfc2781.html