[darcs-users] UTF-16 (was: Default binary masks)
Kevin Smith
yarcs at qualitycode.com
Tue Nov 25 17:21:51 UTC 2003
Sean E Russell wrote:
> On Tuesday 25 November 2003 02:56, Trevor Talbot wrote:
>>within the 16bit unit. As a result of the simple surrogate encoding,
>>you need to scan a maximum of 1 codepoint in either direction to
>>retrieve the full character you landed on, or a maximum of 3 codepoints
>>to retrieve the next or previous characters. Compare to UTF-8 scanning.
>
> ... this is not true. Since (a) the number of bytes per character is
> variable, and (b) the byte sex is variable, and (c) subsequences are
> ambiguous, UTF-16 is *not* random access, and you can't correct it by
> scanning in either direction. You must scan the entire UTF-16 stream up to a
> character to be able to decode that character.
As much as I dislike UTF-16 files, it appears from reading the
standard[1] that you can, in fact, scan forward or back just a little
bit to find the start of a character. As with UTF-8, they designed it
such that the first part of a multi-part sequence falls within a unique
range. And, subsequent parts of a multi-part sequence are also
identifiable as such.
I agree with your other concerns/complaints about UTF-16.
Although I personally find this topic interesting, and it is marginally
relevant to darcs, I suspect we would be better off postponing most of
this discussion until some darcs unicode extensions are imminent.
Kevin
[1] http://www.faqs.org/rfcs/rfc2781.html
More information about the darcs-users
mailing list