[darcs-users] high-level UTF-8 feedback

Petr Rockai me at mornfall.net
Fri Dec 25 19:42:39 UTC 2009


Eric Kow <kowey at darcs.net> writes:
> 5. Despite our confidence about UTF-8 detection, tagging is still a good idea
>    [a] for sheer conservatism and [b] for potential efficiency gains [c] 
>    because it lets us reliably distinguish NFC UTF-8 vs the other one.
it may make sense to think of proper metadata formatting before the
Ignore-This madness spins completely out of control (it was already bad
with the random junk, and it's only getting worse).

Btw. as for [b] I don't think that scanning for the utf8 tag is
substantially faster than checking whether a string is utf-8 or not.

For [c], excuse my ignorance (I am still in the process of catching up
on darcs-users@) but where exactly do we care whether we have NFC or
not? I.e. is this a matter of correctness for darcs, or is it just a
matter of unexpectedly non-matched --match/--patch?

In the latter case, can we make this optional, and maybe issue a warning
when a non-ASCII matcher is used and we don't have ICU, instead of
having a hard dependency?

Additionally, do we believe that not having to run the metadata being
matched through normalisation is a substantial performance boost? (In
pseudocode) Is "(normalise metadata) `match` (normalise matcher)"
substantially slower than "metadata `match` (normalise matcher)`? I
understand that the tagging is needed for the latter to work reliably.

The catch is that for vast body of the existing patches, this
optimisation is not going to work at all anyway (since even if it is
accidentally in the right format, we don't know -- there is no tag --
and have to re-normalise). So I would say that this comes down to
asking, whether we expect to stick to this particular patch (metadata)
format (free text with Ignore-This tags, maybe utf8 and maybe arbitrary
other encoding) for long enough to make this optimisation pay off. (I am
more leaned toward saying no, and just normalising everything for a good
measure, if we can, and issuing warnings when we cannot).

If a new patch format eventually comes around, it can add some structure
to metadata -- and since things like file uuids will need a new patch
format and conversion of repositories, we can tack in some utf-8
requirements into that format, and hopefully the general ICU (and/or
Haskell-based normalisation solution) situation will be easier by that


More information about the darcs-users mailing list