[darcs-users] Escaping of hunks and file names

David Roundy droundy at abridgegame.org
Thu Nov 11 12:42:53 UTC 2004


On Mon, Nov 08, 2004 at 05:03:33PM +0100, Alexander Staubo wrote:
> David Roundy wrote:
> [snip]
> >>I know next to nothing about Unix terminal emulation, so forgive me if 
> >>this is the expected behaviour. I hadn't noticed the colourization 
> >>before, though.
> >
> >No, this isn't obvious.  Oddly enough, it seems that perl does the same
> >thing.  I don't know what the haskell standard library "isTerminal"
> >function checks, but apparently when these languages call external
> >programs, they somehow are able to trick haskell into thinking it's in a
> >terminal.  :(
> 
> Given the lack of general Unix support for Haskell's "isTerminal", I see 
> two options here:

It could be that I'm wrong, and darcs isn't calling isTerminal when it
ought to be... I haven't looked at that code recently, and I'm not the only
one who has touched it.

> 1) Figure out how to fix the terminal check. Clearly other programs ("ls 
> isatty(fd). Is it easy to call arbitrary C library functions from 
> Haskell? Can you get at the output stream's file descriptor?

Hmmm.  It's a bit of a pain to get the output stream's file descriptor, but
it ought to always be the same, right (0,1,2)? Calling C functions is easy,
but making sure that they exist requires autoconf work, which is a bit of a
pain, so I prefer using the standard haskell functions.

> 2) Add something like a --non-terminal option to all of Darcs' commands, 
> allowing one to force the desired behaviour.

This would be fine, but only as a last resort, I'd say.

> >The hex escaped are how things show up on terminals--it's an attempt to
> >keep from messing up the terminal configuration by displaying escape
> >characters (except for color codes that are intentional).  On a terminal,
> >the hex escaped characters always show up blue...
> >
> >If darcs isn't in a terminal, it never should escape.
> 
> (Btw, possible bug: "darcs annotate" does not do the to-terminal hex 
> escaping, ever.)

Hmmm.  Which annotate are you referring to?  darcs annotate -p . certainly
outputs the patches in their colorized form...

> >>Outputting file names as UTF-8 is fine. However, why is Darcs escaping 
> >>the UTF-8, and in such a non-standard (\yy\) format?
> >
> >Only whitespace (and backslashes) are escaped in that format, and the
> >stupid format is because that is what I came up with when I was coding this
> >ages back.  Technically, only spaces and newlines actually need to be
> >escaped, since they would mess up darcs' parsing of patches--tabs and
> >carriage returns aren't used in darcs patch format as delimiters.
> >
> >Basically, I didn't put much thought into it, since at the time I was
> >thinking it wouldn't often come into play, since I consider white space in
> >filenames a bad idea, and backslashes in filenames also don't greatly
> >enhance the portability of your code.
> 
> Would you be willing, at this stage, to move to a more Unixy escaping 
> syntax? The principle of least surprise etc. When people all over start 
> writing scripts, it's going to be one of Darcs "little warts", I think, 
> that people complain about.

If it could be done in a fully backward-compatible manner, I wouldn't mind.
I really don't want to have another repository format transition.  They're
a royal pain.  I suspect this should be sufficiently doable.  What kind of
escape syntax are you envisioning?  Of course, more than one person would
have to agree that the new escaping is an improvement.

> >>However, XML handles unescaped Unicode (or UTF-8) just fine, as long as 
> >>you declare the appropriate encoding at the beginning, eg. <?xml 
> >>version='1.0' encoding='utf-8'/>.
> >
> >We can't really declare the encoding, since we don't know what the encoding
> >of the user's data is.
> 
> The default encoding in XML is UTF-8. So whether or not you declare it, 
> you must still adhere to a specific encoding.
> 
> For file names, enforcing UTF-8 -- and therefore pretty much outputting 
> them verbatim -- might not be such a bad idea.

The only catch is that we have no idea what the encoding of the file names
is, so doing a conversion can be tricky.

> For actual file data, the best way to do this, I think, is to escape 
> everything above 127 as character references, eg. &#128;. I think you 
> can safely output everything below verbatim. But you can't output all 
> characters as-is because certain combinations can be construed as UTF 
> control sequences even when they aren't.

Well, even worse, if you take non-UTF8 characters, odds are that they will
cause a failure when interpereted as UTF8.  I'd rather as far as possible
avoid dealing with character encodings.  There's no reason we should
require (for example) that all files in a repository have the same
encoding.

I suppose we could give commands that output XML an optional argument to
specify the encoding?
-- 
David Roundy
http://www.darcs.net




More information about the darcs-users mailing list