[darcs-users] UTF-16 (was: Default binary masks)

Sean E Russell ser at germane-software.com
Tue Nov 25 20:43:03 UTC 2003

On Tuesday 25 November 2003 12:21, Kevin Smith wrote:
> As much as I dislike UTF-16 files, it appears from reading the
> standard[1] that you can, in fact, scan forward or back just a little
> bit to find the start of a character. As with UTF-8, they designed it
> such that the first part of a multi-part sequence falls within a unique
> range. And, subsequent parts of a multi-part sequence are also
> identifiable as such.

Only on a two byte boundry.  The two bytes that make up the unique value are 
such that the first byte is a legal second byte of another UTF-16 character, 
and the second byte is a legal first byte of another UTF-16 character.

UTF-16 4-byte characters have the first (high) two bytes in the range 0xD800 - 
0xDBFF and the second (low) bytes in the range 0xDC00 - 0xDFFF.  So, using 
LSB byte sex (which I had reversed in my previous email), DC 00 D8 00 is a 
UTF-16 character 4 byte sequence describing a single character.  But 00 DC 00 
D8 00 FF is also a legal UTF-16 byte sequence, describing 3 characters.  In 
the latter case, given the entry point at D8, you would scan backward and see 
DC 00, and decide that D8 was the MSB of a 4-byte sequence, decoding it as 
(0xDC00D800)... and you'd be wrong.

You have to somehow where the 2 byte boundry is to avoid this.  And you (of 
course) also have to know the byte sex, which information is located at the 
start of the stream.  And you have to be sure that the stream hasn't had any 
bytes inserted or deleted, which would skew your byte boundry.

> Although I personally find this topic interesting, and it is marginally
> relevant to darcs, I suspect we would be better off postponing most of
> this discussion until some darcs unicode extensions are imminent.

Agreed.  I'll resist the temptation to debate what I think are 
misunderstandings about the spec.  That said, I got the impression that David 
was seriously considering adding encoding support to darcs.  The velocity at 
which David works has taken me by surprise before, so I try not to 
underestimate him any more.

--- SER

More information about the darcs-users mailing list