[darcs-users] blue color bug

Juliusz Chroboczek jch at pps.jussieu.fr
Mon Jul 5 06:03:45 UTC 2004


> 1) current behaviour
> 2) blind output
> 3) pretend everything is in UTF8 and convert to current locale
> 4) pretend everything is in 8859-1 and convert to current locale
> 5) have a per-repository setting for encoding/charset and convert to
>    current locale.

6. If it's in [0x20..0x7E] ++ [0x80..0xFF], it's text, otherwise, it's
binary.

The above test will correctly detect text in all Unix locales.  It
will generate some false positives; however, as true binary data
usually contains 0x00, I'd expect the false positives to be rather
rare.

7. Try to determine if it's text in the current locale.  If it's not,
treat it as binary.

If the user is working with a repository in UTF-8 while living in an
ISO-2022-JP locale, he's got other problems.

So how do you detect if text is in the current locale?

* if the current locale is ASCII, it's text if it's in [0x20..0x7E];
* if the current locale is UTF-8, Shift-JIS or Big5, it's text if it's
  in [0x20..0x7E] ++ [0x80..0xFF];
* if the current locale is anything else, it's text if it's in
  [0x20..0x7E] ++ [0xA0..0xFF].

(You could be smarter than that in the case of UTF-8, but I don't
think it's worth the trouble.)

If you've got a recent libc (XPG4, XPG5 or POSIX 2001), you can use
nl_langinfo(3) to detect the current charset.  If you haven't, you
should use Bruno Haible's (or is it Markus Kuhn's?) portable
nl_langinfo replacement (Markus Kuhn's Unicode FAQ will give you a
pointer).

> ad 5) is probably the Right thing to do.  It's just that Gabriel's "Worse
>       is Better" essay keeps echoing in my mind for some reason ;)

« Because it is the right thing, it has nearly 100% of desired functio-
« nality, and implementation simplicity was never a concern so it takes
« a long time to implement.  It is large and complex. »  Op. cit.

                                        Juliusz





More information about the darcs-users mailing list