[darcs-users] UTF-16 (was: Default binary masks)

Alex Shinn foof at synthcode.com
Thu Nov 27 02:03:36 UTC 2003


At Wed, 26 Nov 2003 08:33:16 -0500, David Roundy wrote:
> 
> Fortunately UTF-8 should be no problem, except that replace may behave
> interestingly, since it works on a byte-by-byte basis--so you shouldn't use
> replace with multibyte characters.

The first byte of a UTF-8 character determines the byte length, and
starting bytes won't appear as the successive bytes, so if you match the
byte representation of a UTF-8 character you're guaranteed to really
have matched that character.  So a string search at the byte level is
equivalent to a search at the character level, and thus replace is
perfectly safe with UTF-8[1].

Unless you're doing linguistic processing (including text editing and
font rendering) you almost never need to treat UTF-8 differently from
plain ASCII.


Footnotes: 
[1]  assuming you stick to the same normalized form (if not there's no
     corruption but you may miss some matches)

-- 
Alex




More information about the darcs-users mailing list