[darcs-users] UTF-16 (was: Default binary masks)

David Roundy droundy at jdj5.mit.edu
Thu Nov 27 11:49:30 UTC 2003


On Thu, Nov 27, 2003 at 11:03:36AM +0900, Alex Shinn wrote:
> At Wed, 26 Nov 2003 08:33:16 -0500, David Roundy wrote:
> > 
> > Fortunately UTF-8 should be no problem, except that replace may behave
> > interestingly, since it works on a byte-by-byte basis--so you shouldn't use
> > replace with multibyte characters.
> 
> The first byte of a UTF-8 character determines the byte length, and
> starting bytes won't appear as the successive bytes, so if you match the
> byte representation of a UTF-8 character you're guaranteed to really
> have matched that character.  So a string search at the byte level is
> equivalent to a search at the character level, and thus replace is
> perfectly safe with UTF-8[1].
> 
> Unless you're doing linguistic processing (including text editing and
> font rendering) you almost never need to treat UTF-8 differently from
> plain ASCII.

It seems that replace is safe if your tokens only contain single byte
characters, but if you want to include multibyte characters in your tokens,
the tokenizing code will split the bytes of a single character, which is
definitely not a good thing.  A small extension to the tokenizing code to
support [^ \n\t] type specification would make replace work all right with
multibyte characters, as long as you specify the token delimiters rather
than the valid token characters (or the valid token characters are all
single-byte).
-- 
David Roundy
http://www.abridgegame.org




More information about the darcs-users mailing list