[darcs-users] Re: XML format

Alexander Staubo alex at byzantine.no
Thu Dec 16 18:40:46 UTC 2004


Thomas Zander wrote:
>><whatsnew>
>>   <patch file="foo/bar">
>>     <hunk line="5">
>>       <delete>Lorem ipsum dolor &quot;sit</delete>
>>       <add>amet, consectetuer adipiscing elit</add>
>>     </hunk>
>>   </patch>
>></whatsnew>
> 
> Much better then mine :)
> You made 1 mistake, though.  The </delete> needs to be on the next line 
> since this way you forgot all the line endings.  So as long as darcs is 
> per-line based, you need to add the exact \n and/or \r in the xml.

You're right -- that was just an example. :)

>>The Darcs format also has no concept of encodings, which is a typical
>>Unix problem
> 
> 'problem' is a bit overstated IMO.  The fact that everthing will be parsed 
> and seen as latin1 avoids many conversions which end up being unneeded 
> anyway (since darcs does not actually_use_ the text).
> This will surely be different if darcs chooses to not be line based anymore. 
> More below.

It's a problem once you start moving text into encoding-aware applications.

As far as I know, the "latin1" assumption is false. It's actually ASCII, 
with the upper 8th bit left as an undefined encoding.

For example, let's say I'm writing a web front end which displays 
patches. This being XHTML, I prefer UTF-8 for my page. Now, my program 
invokes Darcs, reads its output, blatantly assuming it's ISO-8859-1. Now 
I include the patch contents in my XHTML output, which involves 
converting the contents to UTF-8.

This will work fine if the original patch was ISO-8859-1, because the 
transformation will be valid. But if the original format was UTF-8, the 
transformation might be invalid.

Here's a valid transformation:

 >>> s = "café"  # original ISO-8859-1 string
 >>> s = s.decode("iso-8859-1")  # translate to UTF-16
 >>> s
u'caf\x82'
 >>> s = s.encode("utf-8")  # translate to UTF-8
 >>> s
'caf\xc2\x82'

Here's trying to do UTF-8->ISO-8859-1 on an UTF-8 string:

 >>> s = "caf\xc2\x82"  # UTF-8
 >>> s = s.decode("iso-8859-1")  # translate to UTF-16
 >>> s
u'caf\xc2\x82'
 >>> s = s.encode("utf-8")  # translate to UTF-8
 >>> s
'caf\xc2\x82'

Here's an observation. The above problem does not exist in non-XML 
programs, because in those cases the text's encoding is undefined and 
must be defined with some additional metadata every time you process the 
data or, in the case of programs (such as, say, grep) that may assume a 
character == octet relationship, not at all. But once you enter the 
world of XML, you *must* specify the encoding, because in XML text is a 
first-class citizen where a character is an abstract thing fitting into 
the Unicode space, and there's no such thing as a neutral, byte-oriented 
encoding.

>>with XML this problem doesn't automagically go away, but 
>>becomes easier to deal with as encoding metadata can be preserved and
>>described in a standard manner; if my document begins,
>>
>>   <?xml version="1.0" encoding="utf-8">
>>
>>then your parser had damn well better treat the document as UTF-8.
> 
> Thats true;  but assumes that darcs knows how to read the users provided 
> file correctly. i.e. knows the encoding of each file in the repo.  Which is 
> a problem that has not yet been addressed.

I think it's a problem that needs to be handled.

Fortunately, from the user point of view it should be easy and largely 
transparent:

   # darcs add foo.c --encoding utf-8

Darcs should be able to autodetect encodings based on some simple 
heuristics, such as the UTF-8 BOM signature and the <?xml 
encoding="foo"?> directive.

How does Subversion solve this problem? I'd assume using encoding metadata.

Doesn't POSIX/glibc locales also provide an encoding system?

> Until that has happened, the advantage is null.  Then again, it _is_ easy to 
> assume everything on disk is latin1, and place everything in the XML as 
> utf8, which delays the conversion until that is implemented in a forward 
> compatible manner.

I think assuming anything is Latin 1 is a mistake, because the 
transformations on the resultant XML won't be consistent. Aren't there 
any Russian or Japanese Darcs users here to help argue my case? :)

Alexander.




More information about the darcs-users mailing list