[darcs-users] UTF-16 (was: Default binary masks)

Sun Nov 30 17:36:13 UTC 2003

On Sat, Nov 29, 2003 at 01:20:42PM -0500, Sean E. Russell wrote:
> Hash: SHA1
> 
> On Thursday 27 November 2003 06:49, David Roundy wrote:
> > > Unless you're doing linguistic processing (including text editing and
> > > font rendering) you almost never need to treat UTF-8 differently from
> > > plain ASCII.
> >
> > It seems that replace is safe if your tokens only contain single byte
> > characters, but if you want to include multibyte characters in your tokens,
> > the tokenizing code will split the bytes of a single character, which is
> > definitely not a good thing.  A small extension to the tokenizing code to
> > support [^ \n\t] type specification would make replace work all right with
> > multibyte characters, as long as you specify the token delimiters rather
> > than the valid token characters (or the valid token characters are all
> > single-byte).
> 
> I'm not sure what you're saying, so I could be waaay wrong about this,
> but: the rules of UTF-8 state that the first 127 characters map 1-1 to
> 7-bit ASCII, and that none of those 127 characters can appear as a byte
> in any other character.
> 
> So, I'm understanding you to be saying that replace doesn't replace a
> List (as in, order is significant, and duplicates are allowed) of bytes
> with another List of bytes?  Is this a 'tr' replace?

It does do a list replacement with another list...  Hmmmm.  You're right
that there shouldn't be a real problem in terms of data corruption.

The only problem now that I think about it would be that if you allowed
tokens containing multibyte characters, you'd end up missing some
replacements possibly.  Using alphanumerics to represent 7-bit ASCII and
numerals for other bytes, we'd
have the problem that if you defined your tokens as containing:

['a' 'b' 'c' 'd' '21' '22']

then "fg731abcf" would be thought to contain the token "1abc", rather than
the correct token "abc", since tokens are defined in terms of which bytes
are allowed.  This would mean that a darcs replace "abc" "bcd" would miss
this occurance of "abc".
-- 
David Roundy
http://www.abridgegame.org