[darcs-users] Colin Walters blogs on Arch changesets vs Darcs

Colin Walters walters at verbum.org
Sat Nov 20 02:53:15 UTC 2004


[ A friend pointed me to the response on this list, so I just
subscribed ]

On Fri, 2004-11-19 at 19:16 -0500, Andrew Pimlott wrote:

> On Thu, 18 Nov 2004, Colin Walters wrote:
> > I mentioned in my previous blog entry about revision control that I
> > thought that the Arch model of changesets which are independent of
> > project history is crucial. But why is that?
> [snip]
> > And as I mentioned before, an Arch changeset is basically just a
> > super-patch that handles binary files and renames. If projects include
> > just a bit of constant-sized metadata in their tarballs, the logcial
> > file identity, you can run tla mkpatch old-tree new-tree to generate a
> > changeset between those two trees. You do not need access to the Arch
> > repository.
> 
> This could be done with darcs:  Publish the "darcs changes --context" in
> the tarball,

From rom a look at the manual, it looks to me like that would involve
shipping your entire history around in every single tarball you release.
Do you truly think that is a practical solution?

> It admittedly goes somewhat against the grain of darcs.  Furthermore, it
> assumes the user has darcs installed (just as the quoted example assumes
> the user has tla installed), 

Well, as I mentioned, the changeset functionality could fairly easily be
broken out into a separate program.  I think that would be a useful
thing to do.  For example, if the Linux kernel added arch-tag headers to
their files, then you could use 'tla mkpatch' to create a changeset to
send to them, *even though they still use Bitkeeper*.  They could still
apply that changeset, and to Bitkeeper, it wouldn't be any different
than them applying a regular GNU patch, and manually adding in e.g. a
firmware binary file that would have to be a separate attachment before
changesets.

> which raises the question of why he didn't
> use darcs to get the code in the first place.  

Because he didn't necessarily expect to be hacking on the code; he just
downloaded the latest tarball release of Conglomerate from their
website, tried to build it, and discovered that it was missing e.g. an
#include.

> (And you can do a darcs
> get --partial to avoid downloading the entire history.)  

Sure; but as I understand the theory of patches, the lack of a logical
file identity concept means that if a number of renames have occurred
upstream since the .tar.gz release, when they receive this Darcs patch,
it could happen to apply to a totally different file that was moved into
the same name (a good example would be Makefile.am).  Either that, or
the patch completely fails to apply.

> Nevertheless, I
> think we can have this feature if people want it.  I don't see it as a
> reason for pereferring arch's less strict notion of patches.

My general thesis here is that you really want both inexact patching and
logical file identity.

> With darcs, the XEmacs people would say
> 
>     darcs pull --patch 'some patch to ibuffer.el' http://darcs.gnu.org/...

Right; I never claimed that this is theoretically impossible in Darcs,
merely that is impractical.

> And the darcs patch would probably be more likely to merge correctly (if
> there are no conflicts) 

Note that Arch changesets, being just regular GNU patches, can be
applied using the normal patch .rej mechanism in the case of conflict.

> > In Darcs, this would, as far as I can tell, not work. The reason is
> > that in order to correctly merge these individual changesets, you
> > would require access to the entire history at once (in memory, no
> > less!). That's because Darcs needs that history in order to correctly
> > reorder patches and infer renames.
> 
> If I understand correctly, your basic point is correct, though it is not
> quite that bad.  With the above command, darcs would have to download
> all of the patches (or at least all of them up to "some patch") not in
> the current XEmacs repo, ie all since the branch.  

That's what I thought; you need all 14 years of history since the fork.

Now, this leads me to my next question; I've seen references to Darcs
keeping history up to a "horizon".  In Arch, the each revision does
carry with it a single file (the patch log) per changeset merged.  So
without any intervention, your working tree grows proportionally to
history size.  Obviously, that's bad.  However in Arch, you can "prune"
the patch logs.  I did this for Rhythmbox not too long ago with a script
that looked for changesets against ancient versions, and "dead"
branches.  However - this only affects further revisions.  Your history
is still transparently saved because the archive, where changesets are
stored, is separate from the working tree.  If you wanted, you could go
back and examine that "early" history.

So in Darcs, I would assume that the "horizon" means that patches before
that are simply dropped?  I don't see that Darcs has any formal means of
keeping around pre-horizon history, although certainly I can imagine
just copying the tree.  But would you expect that the Emacs and XEmacs
people would have their horizons going back 14+ years?  *Every* "darcs
get" containing all 14 years of history, plus some amount before the
fork?  Every branch duplicating that history?

> And it would indeed
> suck lots of memory commuting them (again, just the patches since the
> branch, but on both forks), however this again is a performance detail.
> (I say this jokingly because I do believe that the performance issues
> can and will be solved, and they don't bite me hard; but if you need
> performance now or can't take the risk that they never get solved, darcs
> loses.)

I think that characterizing this as simply "performance" is a bit
disingenuous.  We're really talking about what the fundamental
architecture of a distributed free software revision control system
should be.  We really have to get this right the first time; finding out
14 years later that the RCS we choose "can't do that" isn't really an
option.

> What I think this really calls for is a smart darcs server that can do
> all the necessary commuting on darcs.gnu.org, and send just the
> appropriate representation of "some patch". 

I'm surprised you didn't call me on this, but I realized after I posted
that there's an obvious question - how do the XEmacs people determine
which revisions are applicable to ibuffer.el in the first place, without
traversing the Emacs history anyways?  I think the answer to that would
have to be some sort of smart server, so Arch doesn't have a magic
bullet for the initial problem either.

But for Arch, determining revisions applicable to a file is a problem
that is bounded by the history *relating to that file*, whereas in
darcs, it's bounded by the total size of history (in order to do renames
correctly, right?).  In Arch, you just do:

modified = []
for archive in archives:
   for changeset in changesets(archive):
     if changeset.modifies(filename): 
         modified.append(changeset.name)

Maybe you can just say that darcs.gnu.org has to be a big machine with
64GB of RAM or whatever, but that doesn't seem like a solution to me.

>  I think this will happen
> some day.  Until then, I think you're right that darcs loses in this
> case.  (I'm not convinced this should be a show-stopper, since if you're
> going to be doing a lot of merging, downloading the gnu.org repo is a
> modest price.)

Yes.  Honestly, I would expect that in a big merge-fest scenario like
this, the XEmacs people would want a local cache of the Emacs history,
and vice versa.  However, it seems to be that in Darcs, not only do you
need to pull that entire history into memory at once - you also need to
pull it all back from the "horizon".  So it's not just the person doing
the merge that pays the price - every single person checking out
Emacs/XEmacs from a Darcs repository now has to suck 14+ years of
history along with it.  I just don't see this as a fundamentally
scalable solution, no matter what performance tricks one pulls.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.osuosl.org/pipermail/darcs-users/attachments/20041119/73dc3cc7/attachment.pgp 


More information about the darcs-users mailing list