[darcs-users] UTF-16 (was: Default binary masks)
Sean E. Russell
ser at germane-software.com
Sun Nov 30 20:06:00 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
Ok, caveat time:
I *know* that I don't understand the entire darcs side of the problem yet, so
there's great potential for talking at cross-purposes here. However, this
has never stopped me before, so I'm forging blindly ahead.
On Sunday 30 November 2003 12:36, David Roundy wrote:
> > So, I'm understanding you to be saying that replace doesn't replace a
> > List (as in, order is significant, and duplicates are allowed) of bytes
> > with another List of bytes? Is this a 'tr' replace?
> It does do a list replacement with another list... Hmmmm. You're right
> that there shouldn't be a real problem in terms of data corruption.
> The only problem now that I think about it would be that if you allowed
> tokens containing multibyte characters, you'd end up missing some
> replacements possibly. Using alphanumerics to represent 7-bit ASCII and
> numerals for other bytes, we'd
> have the problem that if you defined your tokens as containing:
> ['a' 'b' 'c' 'd' '21' '22']
> then "fg731abcf" would be thought to contain the token "1abc", rather than
> the correct token "abc", since tokens are defined in terms of which bytes
> are allowed. This would mean that a darcs replace "abc" "bcd" would miss
> this occurance of "abc".
Allow me to restate the example, with different values:
A UTF-8 string would look something like this, as a List of characters
(ommitting the apostrophes):
[ a b c <omega> d e f ]
Is a string of 7 characters. This is, in UTF-8, a list of 8 bytes:
[ 0x61 0x62 0x63 0xCE 0xA9 0x64 0x65 0x66 ]
0xCE demarkates the start of a multi-byte character of 2 bytes (the Greek
Omega character). The following byte will not be a 7-bit ASCII character.
I think this second List is how darcs sees the string, is this correct? Just
a sequence of bytes, in which there are two illegal characters: EOF and 0x00.
EOF and 0x00 are both 7-bit ASCII characters, and therefore won't occur in a
UTF-8 multi-byte character (all bytes in a UTF-8 multi-byte sequence have
their high bit set) [A]. Furthermore, the byte that starts a multi-byte
sequence is in the range of 0xC0-0xFD, and these characters are also not
allowed as the second byte of a multi-byte sequence [B]. IE, the first byte
is 0xC0-0xFD, and following bytes are 0x80-0xBF.
So, back to the example.
[ f g <o-umlaut> a b c f ]
would look to darcs like the character list:
[ 0x66 0x67 0xC3 0xB6 0x61 0x62 0x63 0x66 ]
replace "abc" "bcd"
replace [ 0x61 0x62 0x63 ] [ 0x62 0x63 0x64 ]
The sequence of bytes "abc" is unique in the string.
If all three of the strings -- the string containing text to be replaced, the
replacement string, and the pattern, are all valid UTF-8 strings, then I
don't see the problem. The problem is if the replacement string (or any of
the other strings involved) aren't valid UTF-8. For example, if the string
has been truncated at the beginning or the end in the middle of a multi-byte
If darcs replace() will accept 8-bit bytes, and if replace() is guaranteed to
enter the string on a character boundry, then replace() should be safe for
To make replace() safe for UTF-8 strings when *not* entering on a character
boundry, all replace() has to do is detect whether it has entered on a byte
in the range 0x80-0xBF (by [B]) and skip forward until it hits a byte that
isn't in that range, or skip bacward until it hits a byte that is in the
range 0xC0-0xFD -- denoting the start of the multi-byte character.
Another way of looking at it is this:
ASCII characters always match the binary pattern 0xxx xxxx
UTF-8 multi-byte boundries match the pattern 11xx xxxx
UTF-8 following byte sequences match 10xx xxxx
So you can always tell at a glance where in a UTF-8 string you land -- in an
ASCII character, at the start of a multi-byte sequence, or in the middle of a
In fact, here is a Haskell UTF-8 character boundry locating function. This
function takes a possibly truncated UTF-8 string and returns the string
starting at the first character boundry. It should be relatively efficient,
although I'm not very good with Haskell, and there may be other optimizations
that can be performed.
utf8Boundry :: String -> String
utf8Boundry  = 
| x < '\127' || x > '\191' = p
| otherwise = utf8Boundry xs
-- Should return the whole string
test1 = utf8Boundry "Sean "++[(chr 195), (chr 182)]++" Russell"
-- The first character is the second half of a two-byte character. This
-- test should return everything after that character.
test2 = utf8Boundry ((chr 182):" Russell")
On average, given ASCII text it'll perform a single comparison and return. If
it does fail to find a boundry on the first or second test, it is bounded to
iterate at most 5 times (since a UTF-8 character can have at most 6 bytes,
and UCS space -- the most common Unicode pages -- encodes to 3 bytes maximum,
for a boundry of two itterations).
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
-----END PGP SIGNATURE-----
More information about the darcs-users