[darcs-users] Handling of non-ASCII characters in darcs.cgi

Ralf Muschall ralf at tecont.de
Thu Dec 29 08:29:09 UTC 2005


Working on a perl program containing german umlauts in latin-1
encoding, I got an empty "annotate" display (instead of
red/green/black source code) in my browser.

In the httpd logs I found this:

| -:450: error: Input is not proper UTF-8, indicate encoding !
| our @MONTH_ABBREVS=qw(Jan Feb Mär Apr Mai Jun  Jul Aug Sep Okt Nov Dez);

The message comes from xsltproc, which does not like latin-1.

It might be bad programming style to have raw umlauts in one's
source, but for a versioning system it would be nice to survive
this somehow (after all, the characters might come from somebody
else's code).

I found a minimal workaround (which recodes the umlauts, but
unfortunately destroys UTF-8 sequences (it changes "ä" into
"ä" etc.)):

A somewhat smarter way would be to recognize unicode sequences
and recode only non-ASCII stuff which doesn't match.

*** darcs.cgi.old       2005-12-23 17:23:24.000000000 +0100
--- darcs.cgi   2005-12-29 09:20:08.000000000 +0100
***************
*** 116,121 ****
--- 116,122 ----

      seek ($xml, 0, 0);
      while (<$xml>) {
+       s/([\200-\377])/"\046\043" . ord($1) . ";"/ge;
        print $pipe $_;
      }
  }

Is there a cleaner way to keep it working with evil sources?

Ralf





More information about the darcs-users mailing list