[darcs-users] Escaping of hunks and file names
Alexander Staubo
alex at byzantine.no
Sun Nov 7 20:01:11 UTC 2004
David Roundy wrote:
> On Fri, Nov 05, 2004 at 07:08:14AM +0100, Alexander Staubo wrote:
>
>>Is there a reason why Darcs escapes file names and text differently?
>
> First off, darcs only escapes either when it sees that it's outputting to a
> terminal, so it shouldn't affect scripting.
By terminal, do you mean that this:
$ darcs what >foo
should not escape the output to foo?
Not what's happening here. Darcs 1) escapes text and file names, and,
oddly enough, 2) colourizes the diffs (though it only colourizes the
"addfile", etc. keywords when outputting to the terminal):
$ python
Python 2.3.4 (#2, Sep 24 2004, 08:39:09)
[GCC 3.3.4 (Debian 1:3.3.4-12)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> inp, out = os.popen2("darcs what")
>>> out.read()
'{\naddfile ./with\\32\\space.txt\nhunk ./with\\32\\space.txt
1\n+Hell\x1b[01;34m\\f8\x1b[00m\x1b[01;34m\\f8\x1b[00m!\n}\n'
As you can see, those are "ANSI" escape codes in there.
The above output is for a changeset with a file called "with space.txt"
containing the string "helløø". I should think that the above
Python-invoked command does not constitute "outputting to a terminal"?
I know next to nothing about Unix terminal emulation, so forgive me if
this is the expected behaviour. I hadn't noticed the colourization
before, though.
> File names are treated differently than file lines, because lines can be
> treated just as a sequence of bytes. File names, in the haskell standard
> libraries, are treated as sequences of unicode characters. Darcs follows
> this convention, and always encodes them as UTF-8.
>
> Sometimes I think this was a mistake... I think that we should view
> filenames as being just a sequence of bytes, but this would mean that if we
> wanted forward-compatibility with haskell compilers we'd have to forgo
> using the haskell IO libraries.
>
> In any case, it's pretty much irrelevant now, since I'm pretty sure there
> are people out there with their file names in their repositories encoded as
> UTF-8 instead of raw octets, and it's not really worth a repository format
> transition.
Outputting file names as UTF-8 is fine. However, why is Darcs escaping
the UTF-8, and in such a non-standard (\yy\) format?
>>There is absolutely no need to escape anything in XML except "<", ">"
>>and "&", and the escaping pollutes the format.
>
> On the XML formatting, I defer to others, who know about such things...
I know a thing or two about XML :) Homegrown escaping techniques do not
play well with XML -- instead of leveraging XML's powerful and
formalized encoding support you're fighting it, and forcing additional
(possibly error-prone) unescaping logic uponn clients.
You can express any Unicode character, or indeed any binary octet, in
XML, using its own escaping mechanism:
<tag>Darcs is open–source software</tag>
However, XML handles unescaped Unicode (or UTF-8) just fine, as long as
you declare the appropriate encoding at the beginning, eg. <?xml
version='1.0' encoding='utf-8'/>.
(As an XML user, I note that Darcs' output lacks both the XML header and
a DTD doctype declaration -- which means I can't automatically validate
it -- and you could do with a namespace declaration as well. I'm
inclined to submit a couple of patches for this once I got the hang of
this Haskell thang.)
>>I'm surprised that Darcs even escapes file names. However, in many
>>places it *doesn't* escape anything:
>
> I'd say (and many people would complain) that most darcs commands are
> *primarily* intended to be parsed by humans. If you have crazy files with
> spaces in them, you need to be careful. It's still parseable, since darcs
> formats things "predictably", it's just more of a pain. But unless
> characters are non-printable, I don't see any reason to escape them.
Based on my previously reported findings, I would say that, strictly
speaking, Darcs is formatting things predictably, but not consistently,
and thus, to users and especially script-writing users, seemingly
unpredictably; Darcs has three ways of outputting stuff: unescaped (as
in "darcs changes -s"), hex-escaped (as in "darcs whatsnew"), and
decimal-escaped (as in "darcs whatsnew"), and as a user I know that I'd
have problems remembering which is happening where.
> Ideally, scripts should use the xml output, which *is* intended to be
> parsed by scripts.
Couldn't agree with you more. However, as far as I can see, only "darcs
changes" and "darcs "annotate" have XML output.
Significantly, "dards whatsnew" does not do XML -- understandable as the
patch format might be considered Darcs' *canonical* patch description
language, but still damn awkward for scripts. (Try writing a LALR
grammar some time for a line-oriented, extremely context-dependent
format such Darcs' -- I've got the hang of it now, but the idea of all
that state juggling isn't conjuring up any butterflies in *my* belly. :)
Again, I'm interested in adding this functionality if you agree that
it's a good idea. Interestingly, the XML output of "whatsnew" would
share elements with the existing XML output of "changes", to the point
where you could say they share the same format. "changes" deals with
persistent patches, "whatsnew" expresses *potential* patches, so the
latter would not have metadata such as author, date, hash or name, but
if would have, say, modify_file with a detailed hunk-style diff.
Speaking of output, Darcs also needs improvement when it comes to
detecting error conditions. For example, "darcs add a-non-existent-file"
will return with exit code 0, as will "darcs add a-file-already-added".
A script could perceive the lack of messages as meaning success and
everything else meaning error, but it's not exactly robust. One of the
things my code needs to do is determine whether a file is recorded in
the repository
Alexander.
More information about the darcs-users
mailing list