[darcs-users] Escaping of hunks and file names

Sun Nov 7 20:01:11 UTC 2004

David Roundy wrote:
> On Fri, Nov 05, 2004 at 07:08:14AM +0100, Alexander Staubo wrote:
> 
>>Is there a reason why Darcs escapes file names and text differently?
> 
> First off, darcs only escapes either when it sees that it's outputting to a
> terminal, so it shouldn't affect scripting.

By terminal, do you mean that this:

$ darcs what >foo

should not escape the output to foo?

Not what's happening here. Darcs 1) escapes text and file names, and, 
oddly enough, 2) colourizes the diffs (though it only colourizes the 
"addfile", etc. keywords when outputting to the terminal):

$ python
Python 2.3.4 (#2, Sep 24 2004, 08:39:09)
[GCC 3.3.4 (Debian 1:3.3.4-12)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import os
 >>> inp, out = os.popen2("darcs what")
 >>> out.read()
'{\naddfile ./with\\32\\space.txt\nhunk ./with\\32\\space.txt 
1\n+Hell\x1b[01;34m\\f8\x1b[00m\x1b[01;34m\\f8\x1b[00m!\n}\n'

As you can see, those are "ANSI" escape codes in there.

The above output is for a changeset with a file called "with space.txt" 
containing the string "helløø". I should think that the above 
Python-invoked command does not constitute "outputting to a terminal"?

I know next to nothing about Unix terminal emulation, so forgive me if 
this is the expected behaviour. I hadn't noticed the colourization 
before, though.

> File names are treated differently than file lines, because lines can be
> treated just as a sequence of bytes.  File names, in the haskell standard
> libraries, are treated as sequences of unicode characters.  Darcs follows
> this convention, and always encodes them as UTF-8.
> 
> Sometimes I think this was a mistake... I think that we should view
> filenames as being just a sequence of bytes, but this would mean that if we
> wanted forward-compatibility with haskell compilers we'd have to forgo
> using the haskell IO libraries.
> 
> In any case, it's pretty much irrelevant now, since I'm pretty sure there
> are people out there with their file names in their repositories encoded as
> UTF-8 instead of raw octets, and it's not really worth a repository format
> transition.

Outputting file names as UTF-8 is fine. However, why is Darcs escaping 
the UTF-8, and in such a non-standard (\yy\) format?

>>There is absolutely no need to escape anything in XML except "<", ">" 
>>and "&", and the escaping pollutes the format.
> 
> On the XML formatting, I defer to others, who know about such things...

I know a thing or two about XML :) Homegrown escaping techniques do not 
play well with XML -- instead of leveraging XML's powerful and 
formalized encoding support you're fighting it, and forcing additional 
(possibly error-prone) unescaping logic uponn clients.

You can express any Unicode character, or indeed any binary octet, in 
XML, using its own escaping mechanism:

   <tag>Darcs is open&#8211;source software</tag>

However, XML handles unescaped Unicode (or UTF-8) just fine, as long as 
you declare the appropriate encoding at the beginning, eg. <?xml 
version='1.0' encoding='utf-8'/>.

(As an XML user, I note that Darcs' output lacks both the XML header and 
a DTD doctype declaration -- which means I can't automatically validate 
it -- and you could do with a namespace declaration as well. I'm 
inclined to submit a couple of patches for this once I got the hang of 
this Haskell thang.)

>>I'm surprised that Darcs even escapes file names. However, in many 
>>places it *doesn't* escape anything:
> 
> I'd say (and many people would complain) that most darcs commands are
> *primarily* intended to be parsed by humans.  If you have crazy files with
> spaces in them, you need to be careful.  It's still parseable, since darcs
> formats things "predictably", it's just more of a pain.  But unless
> characters are non-printable, I don't see any reason to escape them.

Based on my previously reported findings, I would say that, strictly 
speaking, Darcs is formatting things predictably, but not consistently, 
and thus, to users and especially script-writing users, seemingly 
unpredictably; Darcs has three ways of outputting stuff: unescaped (as 
in "darcs changes -s"), hex-escaped (as in "darcs whatsnew"), and 
decimal-escaped (as in "darcs whatsnew"), and as a user I know that I'd 
have problems remembering which is happening where.

> Ideally, scripts should use the xml output, which *is* intended to be
> parsed by scripts.

Couldn't agree with you more. However, as far as I can see, only "darcs 
changes" and "darcs "annotate" have XML output.

Significantly, "dards whatsnew" does not do XML -- understandable as the 
patch format might be considered Darcs' *canonical* patch description 
language, but still damn awkward for scripts. (Try writing a LALR 
grammar some time for a line-oriented, extremely context-dependent 
format such Darcs' -- I've got the hang of it now, but the idea of all 
that state juggling isn't conjuring up any butterflies in *my* belly. :)

Again, I'm interested in adding this functionality if you agree that 
it's a good idea. Interestingly, the XML output of "whatsnew" would 
share elements with the existing XML output of "changes", to the point 
where you could say they share the same format. "changes" deals with 
persistent patches, "whatsnew" expresses *potential* patches, so the 
latter would not have metadata such as author, date, hash or name, but 
if would have, say, modify_file with a detailed hunk-style diff.

Speaking of output, Darcs also needs improvement when it comes to 
detecting error conditions. For example, "darcs add a-non-existent-file" 
will return with exit code 0, as will "darcs add a-file-already-added". 
A script could perceive the lack of messages as meaning success and 
everything else meaning error, but it's not exactly robust. One of the 
things my code needs to do is determine whether a file is recorded in 
the repository

Alexander.