[darcs-users] blue color bug

Alex Shinn foof at synthcode.com
Mon Jul 5 02:05:08 UTC 2004


At Sat, 3 Jul 2004 09:22:26 -0400, David Roundy wrote:
> 
> Either 5) or perhaps some degree of auto-detection.  I understand most
> non-utf8 encodings aren't valid utf8, so one can at least autodetect
> whether it is *not* utf8.  So one could check if it appears to be utf8, and
> if so try to display it correctly... and otherwise assume it's in the right
> format for the current locale and hope for the best?

Auto-detection works best if you know the language - for instance if
you know the text is Japanese then it's easy to detect between utf-8,
sjis, euc-jp and iso-2022-jp.  In a pinch I've done this very reliably
with a quick and dirty regexp, and in this case the other encodings
are guaranteed to be invalid as utf-8 if you stray outside of ASCII
chars.

When you don't know the language it becomes much more complicated,
since euc-jp could just as easily be euc-kr (to say nothing of the
fact that all ISO-8859-* encodings are valid as one another), and it
generally takes linguistic processing to get this right (mozilla does
universal detection but not very reliably).  You can make this faster
by looking for common sequences in a language, like "the" in english,
but especially with computer text and source code you're going to get
a lot of pathological cases and auto-detection will never be perfect.
So if darcs goes this route you definitely want to be able to set the
locale on a per-file basis.

-- 
Alex




More information about the darcs-users mailing list