[darcs-users] UTF-16 (was: Default binary masks)

Sean E. Russell ser at germane-software.com
Sat Nov 29 18:20:42 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday 27 November 2003 06:49, David Roundy wrote:
> > Unless you're doing linguistic processing (including text editing and
> > font rendering) you almost never need to treat UTF-8 differently from
> > plain ASCII.
>
> It seems that replace is safe if your tokens only contain single byte
> characters, but if you want to include multibyte characters in your tokens,
> the tokenizing code will split the bytes of a single character, which is
> definitely not a good thing.  A small extension to the tokenizing code to
> support [^ \n\t] type specification would make replace work all right with
> multibyte characters, as long as you specify the token delimiters rather
> than the valid token characters (or the valid token characters are all
> single-byte).

I'm not sure what you're saying, so I could be waaay wrong about this, but:

the rules of UTF-8 state that the first 127 characters map 1-1 to 7-bit ASCII, 
and that none of those 127 characters can appear as a byte in any other 
character.

So, I'm understanding you to be saying that replace doesn't replace a List (as 
in, order is significant, and duplicates are allowed) of bytes with another 
List of bytes?   Is this a 'tr' replace?

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/yON6P0KxygnleI8RAhEYAJwLtmSSBVm+SGvcrDkcBmI4M1R+jgCcDk6A
izANlBboJ2EQBHgpYRK7Xdc=
=yddj
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list