[darcs-users] Re: XML format

Mon Dec 20 09:19:38 UTC 2004

Alexander Staubo <alex at byzantine.no> writes:

> Einar Karttunen wrote:
> Why? Darcs can be statically linked with whatever it chooses. 
> Furthermore, I know there are at least two native Haskell XML, HaXml and 
> hxml.

Neither a expat nor a sablotron binding exists. Writing the former
should however be quite easy. The Haskell libraries are quite good
and hxml is even quite reasonable sized...

> Why should illegal UTF-8 sequences ever occur?

Because people want to store all kinds of funny things. Forcing all
files with non-valid utf8 go through hoops would be quite painful.

>> Wouldn't writing a library interface to Darcs be the best and most
>> clean solution to this all?
>
> Moving Darcs into a library for which the command-line program is just 
> another client is a good idea. But I feel that is orthogonal to what we 
> are discussing, because XML is still more language-neutral than a
> library.

But building whatever XML interface you want is also trivial on top of
the library and no-one can protest as that does not affect the darcs
binary. 

> The Subclipse folks, who are developing an Eclispe plugin for 
> Subversion, have been using the library approach from Java, by writing a 
> JNI interface to Subversion. Subversion's compilation dependency hell 
> means very few people are able to compile the damn thing, and so the net 
> has been rife with people constantly clamouring for new builds when 
> either Subclipse or Subversion is updated. In fact, the Subclipse people 
> recently stopped shipping JNI library builds, because it's too much
> hassle.

This tells us mostly that having our tool depend on tons of libraries
makes it hard to compile for people. 

> The web is encoded; the days when everything was ISO-8859-1 and 
> English/Western European are gone, and in fact hardly ever existed 
> except in the minds of English-centric developers.

Lets look at a simple example case. 

We have a system that stores mbox files in a darcs repository. Now each
message inside the file has its own encoding, but we still want to treat
the file as "text" for darcs. How would this work in the encoding
sensitive land?

Also I want to use batch jobs adding + recording files to a
repository. Now the batch job cannot generally know what encoding a
single file is supposed to have...

And if we want encodings then I would argue for mime-types too
on the same grounds.. 

> I assume that by conversions losing precision you mean transcoding into 
> a character subset and somehow losing information. Why would you do that 
> within Darcs? Why are surrogates a problem?

hmm surrogates seem to work but overlong forms with utf8 could be quite
nasty this was fixed in the unicode spec version 3.1 but not all
implementations conform to this yet.. Also 

Whether one wants to support characters larger than 0x10ffff. Both
decisions can be argued and cause problems for people wanting the
other. Also "The UCS code values 0xd800-0xdfff (UTF-16 surrogates) 
as well as 0xfffe and 0xffff (UCS non-characters) should not appear 
in conforming UTF-8 streams." - is this lore or a fact?

If you don't want darcs to actually use the encoding for anything 
couldn't you just put a file named "encodings" in your repo that 
contains filename:encoding pairs and live happily with that?

- Einar Karttunen