[darcs-users] Handling of non-ASCII characters in darcs.cgi
Ralf Muschall
ralf at tecont.de
Thu Dec 29 08:29:09 UTC 2005
Working on a perl program containing german umlauts in latin-1
encoding, I got an empty "annotate" display (instead of
red/green/black source code) in my browser.
In the httpd logs I found this:
| -:450: error: Input is not proper UTF-8, indicate encoding !
| our @MONTH_ABBREVS=qw(Jan Feb Mär Apr Mai Jun Jul Aug Sep Okt Nov Dez);
The message comes from xsltproc, which does not like latin-1.
It might be bad programming style to have raw umlauts in one's
source, but for a versioning system it would be nice to survive
this somehow (after all, the characters might come from somebody
else's code).
I found a minimal workaround (which recodes the umlauts, but
unfortunately destroys UTF-8 sequences (it changes "ä" into
"ä" etc.)):
A somewhat smarter way would be to recognize unicode sequences
and recode only non-ASCII stuff which doesn't match.
*** darcs.cgi.old 2005-12-23 17:23:24.000000000 +0100
--- darcs.cgi 2005-12-29 09:20:08.000000000 +0100
***************
*** 116,121 ****
--- 116,122 ----
seek ($xml, 0, 0);
while (<$xml>) {
+ s/([\200-\377])/"\046\043" . ord($1) . ";"/ge;
print $pipe $_;
}
}
Is there a cleaner way to keep it working with evil sources?
Ralf
More information about the darcs-users
mailing list