[darcs-users] UTF-16 (was: Default binary masks)

Sean E. Russell ser at germane-software.com
Sun Nov 30 22:04:25 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday 30 November 2003 15:28, David Roundy wrote:
> I think I see the confusion.  The key is that darcs replace doesn't just
> replace a given sequence of bytes wherever it occurs.  It only replaces
> that sequence of bytes if it occurs as a single "token".  Where "token" is
> defined as a contiguous set of bytes within a specified set, bounded by
> bytes not lying in the set of allowed bytes.

Ouch.  Yeah, that's a bugger.  Tokens, then, aren't Lists.  They're Sets.  
Unordered sets, to be specific.

So, if the token set were "foo", replace do the equivalent to matching on the 
regexp:

	(^|[^foo]*)[foo]+([^foo]+|$)

IE, a byte that isn't in the token set (or, presumably, no byte), followed by 
some token bytes, followed by a byte that isn't in the token set.  Or, put 
another way, you're token set defines what regex calls "\w".

I misunderstood you.  I thought tokens were Lists, where element order is 
significant and element cardinality isn't.

Yeah, this won't work without some reworking of replace().

> darcs replace <omega> a
>
> (assuming for a moment that we had a way to specify <omega> on the command
> line.)

Incidentally, if the locale is set correctly, most modern terminals can 
understand UTF-8 characters (or various other encodings, of course).  For 
instance, I've got the Esperanto characters -- which are multi-byte UTF-8 
characters -- mapped via xmodmap, so I can just type them into any KDE 
application.

> Requiring tokens to begin or end on UTF-8 boundaries would fix this
> particular problem, but leave other problems, and require a special patch
> type, since non-UTF8-users would want their latin1 files to be replaced
> normally as well.  A better solution would be to tokenize in unicode

This gets back to the beginning of the thread: multiple encoding support is a 
Pain In The Ass.  The only exception is when the encodings overlap; for 
example, supporting 7-bit ASCII and UTF-8 is really easy.

> in how you define your tokens (either tokens don't contain multibyte
> characters, or all multibyte characters are legal in tokens), but leaves us
> with just one replace patch type, which is reasonably simple.

There is a third option: add an encoding header to the patch that specifies 
which encoding is used, then use some i18n library (iconv) to convert from 
that encoding to UTF-8.  Internally to darcs, you'll deal only with UTF-8 
strings, and then converting to and from the i18n encoding on reading or 
writing streams.

iconv, if you're interested, is at http://www.gnu.org/software/libiconv/.

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/ymlpP0KxygnleI8RAq9sAJ9qN+Sr8ajPyIXeaBxVooLvfOPdhwCgnAhR
Suv2B9bsAFwWEtOE1Wx/jOE=
=0987
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list