[darcs-users] UTF-16 (was: Default binary masks)
Sean E. Russell
ser at germane-software.com
Sun Nov 30 22:04:25 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
On Sunday 30 November 2003 15:28, David Roundy wrote:
> I think I see the confusion. The key is that darcs replace doesn't just
> replace a given sequence of bytes wherever it occurs. It only replaces
> that sequence of bytes if it occurs as a single "token". Where "token" is
> defined as a contiguous set of bytes within a specified set, bounded by
> bytes not lying in the set of allowed bytes.
Ouch. Yeah, that's a bugger. Tokens, then, aren't Lists. They're Sets.
Unordered sets, to be specific.
So, if the token set were "foo", replace do the equivalent to matching on the
IE, a byte that isn't in the token set (or, presumably, no byte), followed by
some token bytes, followed by a byte that isn't in the token set. Or, put
another way, you're token set defines what regex calls "\w".
I misunderstood you. I thought tokens were Lists, where element order is
significant and element cardinality isn't.
Yeah, this won't work without some reworking of replace().
> darcs replace <omega> a
> (assuming for a moment that we had a way to specify <omega> on the command
Incidentally, if the locale is set correctly, most modern terminals can
understand UTF-8 characters (or various other encodings, of course). For
instance, I've got the Esperanto characters -- which are multi-byte UTF-8
characters -- mapped via xmodmap, so I can just type them into any KDE
> Requiring tokens to begin or end on UTF-8 boundaries would fix this
> particular problem, but leave other problems, and require a special patch
> type, since non-UTF8-users would want their latin1 files to be replaced
> normally as well. A better solution would be to tokenize in unicode
This gets back to the beginning of the thread: multiple encoding support is a
Pain In The Ass. The only exception is when the encodings overlap; for
example, supporting 7-bit ASCII and UTF-8 is really easy.
> in how you define your tokens (either tokens don't contain multibyte
> characters, or all multibyte characters are legal in tokens), but leaves us
> with just one replace patch type, which is reasonably simple.
There is a third option: add an encoding header to the patch that specifies
which encoding is used, then use some i18n library (iconv) to convert from
that encoding to UTF-8. Internally to darcs, you'll deal only with UTF-8
strings, and then converting to and from the i18n encoding on reading or
iconv, if you're interested, is at http://www.gnu.org/software/libiconv/.
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
-----END PGP SIGNATURE-----
More information about the darcs-users