[darcs-users] [darcs-devel] [patch639] Use utf-8 charset for darcs send in case of non-ascii ...

Stephen J. Turnbull stephen at xemacs.org
Sun Jul 10 09:22:31 UTC 2011


Eric Kow writes:

 > I suggested on the patch tracker [1] that we do the same thing we do
 > when reading patch metadata from the user, guess and fall back: try
 > decoding as UTF-8 and if there are errors doing so, decode as locale.
 > Would that be good?

I don't know.  I've been dealing with a truly insane coding situation
(aka, I live in Japan, where there are FIVE major encodings in daily
use) for twenty years.  I don't have such problems any more, not twice
with the same software.  I walk softly and carry a big `rm'.  ;-)

It probably will work pretty well, in theory, though.  The signature
of a UTF-8 octet stream is very distinctive.

The big problem that you face is short sequences of extended Shift
JIS, Big 5, and Windows-125x that are mostly ASCII.  That sounds a lot
like a typical email message with correctly spelled name and/or .sig
to me.  The problem is that all of these encodings can present an
occasional short sequence that has the characteristic first octet in
BF-D7, second octet in 80-BF pattern of 2-octet UTF-8 characters (they
can also present longer sequences, but those are rather unlikely in
real text as far as I know).

You should be careful to insist on correctly formatted UTF-8.  For
example, just following the decoding algorithm blindly, hex "C0 B0"
would decode to hex "30", ie, the ASCII code for the digit "3".  If
you are strict about this, then anything decoded and stored will
round-trip exactly as a stream of bytes, so it will be possible to
recover the original bytes, and probably a human can figure out which
encoding was meant originally (by cut and try with iconv, if nothing
else).

 > The sort of experience and user advocacy that Stephen and Dan provide is
 > valuable to the project,

Well, in the cool light of the morning, it's not clear to me that such
strong reactions are entirely justified.  But my immediate reacton was
"I just wouldn't want anybody writing anything like

    http://www.jwz.org/doc/cadt.html

about Darcs." :-)



More information about the darcs-users mailing list