[darcs-users] high-level UTF-8 feedback

Sun Jan 3 13:40:51 UTC 2010

Hi Petr and darcs-users,

2009/12/25 Petr Rockai <me at mornfall.net>:
> Eric Kow <kowey at darcs.net> writes:
>> 5. Despite our confidence about UTF-8 detection, tagging is still a good idea
>>    [a] for sheer conservatism and [b] for potential efficiency gains [c]
>>    because it lets us reliably distinguish NFC UTF-8 vs the other one.
> it may make sense to think of proper metadata formatting before the
> Ignore-This madness spins completely out of control (it was already bad
> with the random junk, and it's only getting worse).
>
> Btw. as for [b] I don't think that scanning for the utf8 tag is
> substantially faster than checking whether a string is utf-8 or not.

True

> For [c], excuse my ignorance (I am still in the process of catching up
> on darcs-users@) but where exactly do we care whether we have NFC or
> not? I.e. is this a matter of correctness for darcs, or is it just a
> matter of unexpectedly non-matched --match/--patch?

Just a matter of matching.

> In the latter case, can we make this optional, and maybe issue a warning
> when a non-ASCII matcher is used and we don't have ICU, instead of
> having a hard dependency?

We could do that, but I would prefer to have darcs "just work" and do the
right thing, instead of issue a warning. Let me reiterate that a lot of
packages use ICU these days; a quick Google search turns up CouchDB
and OpenOffice.org.

> Additionally, do we believe that not having to run the metadata being
> matched through normalisation is a substantial performance boost? (In
> pseudocode) Is "(normalise metadata) `match` (normalise matcher)"
> substantially slower than "metadata `match` (normalise matcher)`? I
> understand that the tagging is needed for the latter to work reliably.

I'd think it is when you're doing a linear search through a potentially large
patch history.

> The catch is that for vast body of the existing patches, this
> optimisation is not going to work at all anyway (since even if it is
> accidentally in the right format, we don't know -- there is no tag --
> and have to re-normalise). So I would say that this comes down to
> asking, whether we expect to stick to this particular patch (metadata)
> format (free text with Ignore-This tags, maybe utf8 and maybe arbitrary
> other encoding) for long enough to make this optimisation pay off. (I am
> more leaned toward saying no, and just normalising everything for a good
> measure, if we can, and issuing warnings when we cannot).

I for one hope to have designed something that can last quite a while,
so that this
optimization will pay off someday. Perhaps for all those folks that are going to
convert their huge-ass git repos to darcs one day ;-).

Reinier