[darcs-users] Coalescing patches

Stephen J. Turnbull stephen at xemacs.org
Thu Sep 24 14:29:31 UTC 2009

Nik writes:

 > > I wonder if a repo (actually, darcs changes and maybe darcs whatsnew)
 > > should have a notion of "my patches" and "your patches", or even a
 > > more fine-grained (word for the day!) hierarchical structure
 > > reflecting the developers' organization.  But at this point that's a
 > > new thread, I guess.
 > Interesting ... when you say "developers' organization", do mean 
 > something like an org-chart of the developers, or more the way a single 
 > developer organises their repo?

I meant the org-chart.  I write the VFS, you write the concrete FS, we
want to see each others' patches in real time and in detail.  The
video system changes can come in a batch once a month or so, I don't
care as long as the system builds (and I can just revert the bad merge/
coalesced patch for that).

 > I'm not a big fan of switching branches.
 > I do, however, have a separate suggestion for making spontaneous 
 > branches a little more corporeal, by storing their names somewhere in 
 > the repo or its metadata,

Hey, you've just invented git! ;-)  Seriously, computationally a git
branch is nothing more nor less than a *name* for the head of a singly
linked list which may share some tail with other branches.  (Not quite
true, because of merges, but pretty close.)

 > A feature that could help here might be an ability within darcs to 
 > separate formatting changes from content changes, and save them in 
 > separate patches, even in response to a single record operation.
 > The developer can still put those separate patches into the same 
 > spontaneous branch if they really belong together.
 > It seems to me that would help a lot - your thoughts?

Sure, but it would be hard, I think.  I suspect that if you went to
the trouble of producing such a thing, you'd have a VCS that operate
not on plain text but structured languages.  That part would get
hijacked and people would use WYSIWYG editors rather than editing a
text representation of the structure.

 > > checking style, initiating a build and test cycle, and reviewing
 > > results that take up the bulk of the cherrypicking time.
 > Hmmm, sounds like some or all of this could be pushed off to a CI tool?

Not really.  The checking and reviewing have to be done by a human; as
you say the build/test cycle can run off a commit hook.

 > The feature as I envisaged it would be able to do exactly that:
 > * push --unify from my-repo to stable
 > * push --unify from my-repo to dev, or push unified patch from stable to 
 > dev.

Sure but the tedious part is doing it patch by patch; not all patches
go to both branches.

 > In what context is this difference or similarity being detected?

Suppose you have some block of existing code, with a typo in a
comment.  Commit 1 wraps the block in a conditional, and indents it.
This still needs testing and should stay in the feature branch.
Commit 2 fixes the typo in the comment, and wants to be pushed.

Cherrypicking commit 2 will result in a conflict, based on whitespace
alone.  If commit 1 actually changed the line with the typo, but left
the typo behind, then you're really behind the 8-ball.

 > Just to be clear: As I understand it, git is really two separate things 
 > which now have acquired the same name (confusingly):
 > 1. a high-performance object-storage system;
 > 2. a DCVS built on that high-performance storage system.

The object storage system is actually not that high performance,
especially on Windows.  (Sure, it's pretty well optimized, but what do
you expect from people whose day job is to write file systems?)  The
interesting thing about it is that to the DVCS it appears as just a
big hashtable.  What made git appear so fast in the early days was the
cunning choice of primitive objects: a space efficient, cacheable
representation of trees allowing recursive diffs and similar
operations to be done extremely quickly, and separation of the history
DAG from the content (except for the single reference to a tree).

 > I was proposing a combination of darcs' patch theory as the DCVS, using 
 > only the high-performance object-storage part of git. So there would be 
 > no git history DAG, no git index, etc. Just darcs patches stored in a 
 > high-speed object storage system.

Uhh ... maybe.  I can't deny that it might make a difference, but I
think the key to git's famous speed is not the object system itself,
but simply the extremely concise representation of large objects (such
as a kernel tree) combined with the fact that the representation is
very good at localizing the kinds of changes that are typically being
dealt with.

To give you an example, consider a checkout.  To checkout a new tree,
git needs to (1) find and dereference a commit object (a couple
hundred bytes, mostly log message); (2) find and dereference a tree
object; (3) find and dereference N blobs, and do the io to store them
to the new tree.  Most checkouts will be of relatively recent commits,
so many of the blobs, the commit, and the tree will be in loose
objects.  The rest of the blobs are likely to be contiguous, rather
than delta compressed.  Result: the checkout occurs with nearly 100%
IO efficiency for the tree contents, and very low overhead for the
structural objects (< 1%).  To do the same, Darcs needs to repeatedly
apply patches, and those patches may (probably will, in fact) often
change the some of same lines over and over again.  If you've done a
checkpoint recently, you're in pretty good shape, but if not, this is
quite inefficient.

It's worse for things like diffs.  Git derefs two commits, two trees,
and then compares the tree objects.  Only the changed objects,
typically one or two to a few dozen, out of perhaps thousands of
files, need be compared.  For Darcs, though, you'll need to look at
all the diffs between the commits.  I'm not saying it can't be done
efficiently (or that it isn't being done efficiently), just that the
git strategy for efficient operation is almost a no-brainer given the
data structure, while for Darcs's chain-of-patches representation it's
a harder conceptual problem, and I don't think that using the git
storage engine will make that much difference to Darcs performance.

More information about the darcs-users mailing list