[darcs-users] UTF-16 (was: Default binary masks)

Sean E. Russell ser at germane-software.com
Sun Nov 30 20:06:00 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, caveat time:

I *know* that I don't understand the entire darcs side of the problem yet, so 
there's great potential for talking at cross-purposes here.  However, this 
has never stopped me before, so I'm forging blindly ahead.

On Sunday 30 November 2003 12:36, David Roundy wrote:
> > So, I'm understanding you to be saying that replace doesn't replace a
> > List (as in, order is significant, and duplicates are allowed) of bytes
> > with another List of bytes?  Is this a 'tr' replace?
>
> It does do a list replacement with another list...  Hmmmm.  You're right
> that there shouldn't be a real problem in terms of data corruption.
>
> The only problem now that I think about it would be that if you allowed
> tokens containing multibyte characters, you'd end up missing some
> replacements possibly.  Using alphanumerics to represent 7-bit ASCII and
> numerals for other bytes, we'd
> have the problem that if you defined your tokens as containing:
>
> ['a' 'b' 'c' 'd' '21' '22']
>
> then "fg731abcf" would be thought to contain the token "1abc", rather than
> the correct token "abc", since tokens are defined in terms of which bytes
> are allowed.  This would mean that a darcs replace "abc" "bcd" would miss
> this occurance of "abc".

Allow me to restate the example, with different values:

A UTF-8 string would look something like this, as a List of characters 
(ommitting the apostrophes):

   [ a b c <omega> d e f ]

Is a string of 7 characters.  This is, in UTF-8, a list of 8 bytes:

   [ 0x61 0x62 0x63 0xCE 0xA9 0x64 0x65 0x66 ]

0xCE demarkates the start of a multi-byte character of 2 bytes (the Greek 
Omega character).  The following byte will not be a 7-bit ASCII character.

I think this second List is how darcs sees the string, is this correct?  Just 
a sequence of bytes, in which there are two illegal characters: EOF and 0x00.  
EOF and 0x00 are both 7-bit ASCII characters, and therefore won't occur in a 
UTF-8 multi-byte character (all bytes in a UTF-8 multi-byte sequence have 
their high bit set) [A].  Furthermore, the byte that starts a multi-byte 
sequence is in the range of 0xC0-0xFD, and these characters are also not 
allowed as the second byte of a multi-byte sequence [B].  IE, the first byte 
is 0xC0-0xFD, and following bytes are 0x80-0xBF.

So, back to the example.  

   [ f g <o-umlaut> a b c f ]

would look to darcs like the character list:

   [ 0x66 0x67 0xC3 0xB6 0x61 0x62 0x63 0x66 ]

So

   replace "abc" "bcd"

would be:

   replace [ 0x61 0x62 0x63 ] [ 0x62 0x63 0x64 ]

The sequence of bytes "abc" is unique in the string.

If all three of the strings -- the string containing text to be replaced, the 
replacement string, and the pattern, are all valid UTF-8 strings, then I 
don't see the problem.  The problem is if the replacement string (or any of 
the other strings involved) aren't valid UTF-8.  For example, if the string 
has been truncated at the beginning or the end in the middle of a multi-byte 
character.

If darcs replace() will accept 8-bit bytes, and if replace() is guaranteed to 
enter the string on a character boundry, then replace() should be safe for 
UTF-8 strings.  

To make replace() safe for UTF-8 strings when *not* entering on a character 
boundry, all replace() has to do is detect whether it has entered on a byte 
in the range 0x80-0xBF (by [B]) and skip forward until it hits a byte that 
isn't in that range, or skip bacward until it hits a byte that is in the 
range 0xC0-0xFD -- denoting the start of the multi-byte character.

Another way of looking at it is this:

   ASCII characters always match the binary pattern 0xxx xxxx
   UTF-8 multi-byte boundries match the pattern 11xx xxxx
   UTF-8 following byte sequences match 10xx xxxx

So you can always tell at a glance where in a UTF-8 string you land -- in an 
ASCII character, at the start of a multi-byte sequence, or in the middle of a 
multi-byte sequence.

In fact, here is a Haskell UTF-8 character boundry locating function.   This 
function takes a possibly truncated UTF-8 string and returns the string 
starting at the first character boundry.  It should be relatively efficient, 
although I'm not very good with Haskell, and there may be other optimizations 
that can be performed.  

   utf8Boundry :: String -> String
   utf8Boundry [] = []
   utf8Boundry p@(x:xs)
      | x < '\127' || x > '\191'  = p
      | otherwise                 = utf8Boundry xs
	
   -- TESTS:
   -- Should return the whole string
   test1 = utf8Boundry "Sean "++[(chr 195), (chr 182)]++" Russell"
   -- The first character is the second half of a two-byte character.  This
   -- test should return everything after that character.
   test2 = utf8Boundry ((chr 182):" Russell")

On average, given ASCII text it'll perform a single comparison and return.  If 
it does fail to find a boundry on the first or second test, it is bounded to 
iterate at most 5 times (since a UTF-8 character can have at most 6 bytes, 
and UCS space -- the most common Unicode pages -- encodes to 3 bytes maximum, 
for a boundry of two itterations).

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/yk2oP0KxygnleI8RAuN3AJ9eSqtaXpTuRPMyaW8wHu5OSyLCpACeMiFM
Vtdsw1DDSFRKd1MQgZHlfHs=
=0fEb
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list