[darcs-users] UTF-16 (was: Default binary masks)
Sean E. Russell
ser at germane-software.com
Sat Nov 29 18:20:42 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Thursday 27 November 2003 06:49, David Roundy wrote:
> > Unless you're doing linguistic processing (including text editing and
> > font rendering) you almost never need to treat UTF-8 differently from
> > plain ASCII.
>
> It seems that replace is safe if your tokens only contain single byte
> characters, but if you want to include multibyte characters in your tokens,
> the tokenizing code will split the bytes of a single character, which is
> definitely not a good thing. A small extension to the tokenizing code to
> support [^ \n\t] type specification would make replace work all right with
> multibyte characters, as long as you specify the token delimiters rather
> than the valid token characters (or the valid token characters are all
> single-byte).
I'm not sure what you're saying, so I could be waaay wrong about this, but:
the rules of UTF-8 state that the first 127 characters map 1-1 to 7-bit ASCII,
and that none of those 127 characters can appear as a byte in any other
character.
So, I'm understanding you to be saying that replace doesn't replace a List (as
in, order is significant, and duplicates are allowed) of bytes with another
List of bytes? Is this a 'tr' replace?
- --
### SER
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE/yON6P0KxygnleI8RAhEYAJwLtmSSBVm+SGvcrDkcBmI4M1R+jgCcDk6A
izANlBboJ2EQBHgpYRK7Xdc=
=yddj
-----END PGP SIGNATURE-----
More information about the darcs-users
mailing list