[darcs-users] [issue1143] darcs changes --xml is not consistently encoded

Sun Oct 12 04:09:31 UTC 2008

Trent Buck <trentbuck at gmail.com> added the comment:

On Sat, Oct 11, 2008 at 09:04:14PM -0000, Reinier Lamers wrote:
> I believe I read in a mailing list thread that darcs can't use a
> consistent encoding for metadata, because it uses the metadata for
> hashing precisely (bit-by-bit) as it got it from the operating
> system.

AIUI darcs currently treats everything as byte vectors.  This is fine
as long as everyone uses the same character set and encoding.
Unfortunately, while that might be true for small groups, it's not
true for large, international projects like Darcs itself.

Any existing patches recorded by darcs have lost essential
information: the encoding of the metadata.  It's impossible to get
this back reliably.  So we have two separate issues:

- We need a work around in order to work with existing multi-encoding
  repositories, including Darcs' own repo.  As you say below, probably
  the best we can do is just throw away any non-ASCII characters :-(

- We need to prevent this from happening in future, by either

  0) forcing everyone to use UTF-8.  I think we can just dismiss this
     as impossible, if only because of Japan.

  1) recording the metadata coding as part of the metadata (as done by
     MIME for email); or

  2) by standardizing on a single coding for internal use (that is,
     within the actual patches in _darcs), and converting all user
     input to that coding.

     The encoding used internally isn't particularly important, but
     obvious candidates are UTF-8, UTF-16, Unicode codepoint sequences
     and ISO 10646.  Since UTF-8 has useful properties for

  In both (1) and (2) we need to converted output to the user's
  coding, with some kind of sensible behaviour when that's not
  possible (e.g. user is using ISO 8859-1 and the patch author's name
  contains Greek characters).  The iconv(1) tool might be useful as an
  example of handling such lossy recoding.

> Perhaps we can put a declaration in the XML that the encoding is
> iso-8859-1 (aka latin1)? There is no such thing as invalid
> iso-8859-1, and most data in ASCII-based encoding will look
> reasonable in iso-8859-1.

While this might work around the immediate issue, it is not a long
term solution.  If you forcibly treat the entire byte vector as some
ASCII-compatible eight-bit encoding (e.g. ISO 8859-1), you will
silently(!) get gibberish for

- any non-ASCII character in all other ASCII-compatible codings,
  including UTF-8 and other ISO 8859; and

- *ALL* characters in ASCII-incompatible codings, including the
  popular UTF-16 and JIS.

----------------------------------------------------------------------

As a real-world case study, I compared the Darcs' repo's metadata with
and without invalid UTF-8 characters:

    darcs changes --xml >/tmp/x
    darcs changes --xml | iconv -c -f utf-8 -t utf-8 >/tmp/y
    diff -u /tmp/[xy]

It appears that 'Daniel Bünzli' is using Latin-1 and every other
contributor is using UTF-8, or is using an encoding that happens to
silently convert to gibberish when treated as UTF-8.

If we treat everything as pure ASCII, we can see that there are only
two more cases -- one use of UTF-8 smart quotes, and one UTF-8 ú.

    darcs changes --xml >/tmp/x
    darcs changes --xml | iconv -c -f ascii -t utf-8 >/tmp/y
    diff -u /tmp/[xy]

__________________________________
Darcs bug tracker <bugs at darcs.net>
<http://bugs.darcs.net/issue1143>
__________________________________