[darcs-users] UTF-16 (was: Default binary masks)
droundy at abridgegame.org
Sun Nov 30 20:28:18 UTC 2003
On Sun, Nov 30, 2003 at 03:06:00PM -0500, Sean E. Russell wrote:
> Ok, caveat time:
> I *know* that I don't understand the entire darcs side of the problem
> yet, so there's great potential for talking at cross-purposes here.
> However, this has never stopped me before, so I'm forging blindly ahead.
I think I see the confusion. The key is that darcs replace doesn't just
replace a given sequence of bytes wherever it occurs. It only replaces
that sequence of bytes if it occurs as a single "token". Where "token" is
defined as a contiguous set of bytes within a specified set, bounded by
bytes not lying in the set of allowed bytes.
The problem I see occurs if you want to define your tokens to contain only
certain multibyte characters--for obvious reasons, this isn't likely to
work. For example, to if you want your tokens to consist of only 'a's and
omegas, the closest we could come would be to allow 0x61, 0xCE and 0xA9
bytes as tokens, but if an omega is followed by different multibyte
character starting with 0xCE, that omega would be seen as the token [ 0xCE
0xA9 0xCE ], and wouldn't be replaced if you did a
darcs replace <omega> a
(assuming for a moment that we had a way to specify <omega> on the command
Requiring tokens to begin or end on UTF-8 boundaries would fix this
particular problem, but leave other problems, and require a special patch
type, since non-UTF8-users would want their latin1 files to be replaced
normally as well. A better solution would be to tokenize in unicode
characters, but that looks like quite a pain. The interrim solution is to
not specify multibyte characters in the token definition. This limits you
in how you define your tokens (either tokens don't contain multibyte
characters, or all multibyte characters are legal in tokens), but leaves us
with just one replace patch type, which is reasonably simple.
More information about the darcs-users