[darcs-users] Re: XML format

Thomas Zander zander at kde.org
Mon Dec 20 16:38:13 UTC 2004


Einar,
I have the feeling we are not all on the same page here; and you are 
thinking encodings on a whole different level as Alexander and I am.
I'll try to clear some things up below:

On Monday 20 December 2004 10:19, Einar Karttunen wrote:
> Alexander Staubo <alex at byzantine.no> writes:
> > Why should illegal UTF-8 sequences ever occur?
>
> Because people want to store all kinds of funny things. Forcing all
> files with non-valid utf8 go through hoops would be quite painful.

First; the idea is that all text is re-coded to utf8 for the xml-8. If the 
file you are managing is not utf-8 then it will be converted by darcs.
File encodings outside of darcs can always fall back to the 
lowest-common-dominator encoding: ASCII[1]  So, if you have a file that 
does not seem to be utf-8, darcs can interpret it as ASCII, and then 
convert it to utf-8 in its XML.  This is per definition loss-less.  Reading 
from the darcs XML and converting back to ASCII is the exact opposite of 
the first step; and again, per definition loss-less.
Its just about the same thing as darcs does currently with binairy files and 
absolutely not a 'pain' to do consistently.
Please note that this encoding-handling is done by all xml-libraries out of 
the box, so darcs does not even have to think about it. (which is another 
good reason to use a well-debugged library, should darcs do XML)

> > The Subclipse folks, who are developing an Eclispe plugin for
> > Subversion, have been using the library approach from Java, by writing
> > a JNI interface to Subversion. Subversion's compilation dependency hell
> > means very few people are able to compile the damn thing, and so the
> > net has been rife with people constantly clamouring for new builds when
> > either Subclipse or Subversion is updated. In fact, the Subclipse
> > people recently stopped shipping JNI library builds, because it's too
> > much hassle.
>
> This tells us mostly that having our tool depend on tons of libraries
> makes it hard to compile for people.

I'll try to explain the JNI process; since the troubles have nothing to do 
with libraries or dependencies.
JNI requires 2 parts;  a java part and a c-library part.  The changes in the 
subversion library (which is another c-library) have to be followed in the 
java object-oriented library manually.
Then the compiler has to create a bunch of header files, and then those new 
header files have to be linked with the subversion library, which are then 
going to be used by the JVM.
Its order of compiling and removal of stale files thats hard here, next to 
the manual labour of changing the java files.
I think you will agree that using a library which will not change API is a 
totally different thing, as installing and typing 'make' is all you have to 
do.
IMO the fact that you need another library installed before you can compile 
darcs is not much of a problem at all;  as long as its something widely 
known and mature.

> > The web is encoded; the days when everything was ISO-8859-1 and
> > English/Western European are gone, and in fact hardly ever existed
> > except in the minds of English-centric developers.
>
> Lets look at a simple example case.
>
> We have a system that stores mbox files in a darcs repository. Now each
> message inside the file has its own encoding, but we still want to treat
> the file as "text" for darcs. How would this work in the encoding
> sensitive land?

Not a very good example as the mbox itself forces an encoding, which makes 
it again 1 encoding per file.

Besides, I already mentioned that if all else fails the ASCII encoding 
should be used which will mean darcs can manage it withouth any problems.

> And if we want encodings then I would argue for mime-types too
> on the same grounds..

So, since you are so opposed to encodings; you don't want mime types? We 
surely agree on that one ;-)

> If you don't want darcs to actually use the encoding for anything
> couldn't you just put a file named "encodings" in your repo that
> contains filename:encoding pairs and live happily with that?

You don't have to use encodings at all; binairy in, binairy out already 
works and the conversions done are irrelevant to you, the user.  So if you 
don't like to do conversion in your repo; use this solution if you want.

I kind of like to be able to have things like an UTF-8 filesystem and be 
able to convert old repositories to utf-8 without sacrificing the ability 
to use patches from people that still use the old encoding.


1) in all fairness; ASCII is a charset, while utf-8 is a file-encoding, two 
different steps in converting text to a file.  ASCII assumes 
8-bits-per-character file-encoding, and utf-8 assumes unicode4. So me 
saying ASCII, means that the file on disc can be binary, and still 
interpreted as valid (but maybe not correct) text.
-- 
Thomas Zander
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.osuosl.org/pipermail/darcs-users/attachments/20041220/ca82bf22/attachment.pgp 


More information about the darcs-users mailing list