[darcs-users] UTF-16 (was: Default binary masks)
Sean E Russell
ser at germane-software.com
Tue Nov 25 20:43:03 UTC 2003
On Tuesday 25 November 2003 12:21, Kevin Smith wrote:
> As much as I dislike UTF-16 files, it appears from reading the
> standard that you can, in fact, scan forward or back just a little
> bit to find the start of a character. As with UTF-8, they designed it
> such that the first part of a multi-part sequence falls within a unique
> range. And, subsequent parts of a multi-part sequence are also
> identifiable as such.
Only on a two byte boundry. The two bytes that make up the unique value are
such that the first byte is a legal second byte of another UTF-16 character,
and the second byte is a legal first byte of another UTF-16 character.
UTF-16 4-byte characters have the first (high) two bytes in the range 0xD800 -
0xDBFF and the second (low) bytes in the range 0xDC00 - 0xDFFF. So, using
LSB byte sex (which I had reversed in my previous email), DC 00 D8 00 is a
UTF-16 character 4 byte sequence describing a single character. But 00 DC 00
D8 00 FF is also a legal UTF-16 byte sequence, describing 3 characters. In
the latter case, given the entry point at D8, you would scan backward and see
DC 00, and decide that D8 was the MSB of a 4-byte sequence, decoding it as
(0xDC00D800)... and you'd be wrong.
You have to somehow where the 2 byte boundry is to avoid this. And you (of
course) also have to know the byte sex, which information is located at the
start of the stream. And you have to be sure that the stream hasn't had any
bytes inserted or deleted, which would skew your byte boundry.
> Although I personally find this topic interesting, and it is marginally
> relevant to darcs, I suspect we would be better off postponing most of
> this discussion until some darcs unicode extensions are imminent.
Agreed. I'll resist the temptation to debate what I think are
misunderstandings about the spec. That said, I got the impression that David
was seriously considering adding encoding support to darcs. The velocity at
which David works has taken me by surprise before, so I try not to
underestimate him any more.
More information about the darcs-users