[darcs-users] darcs conflicts/dependencies -- is patch theory the place to start?

Thu Sep 13 03:49:53 UTC 2012

AntC writes:
 > AntC <anthony_clayden <at> clear.net.nz> writes:
 > 
 > > 
 > > > Kevin Quick <quick <at> sparq.org> writes:
 > > > 
 > > > On a more abstract level, Patch Theory ...  
 > > > ... does not require, nor implement a DAG of patches, and in any
 > > > particular repository, all patches are active ...
 >  
 > Hmm, that claim about the unimportance of DAGs seems rather hollow:

It didn't say that DAGs don't exist, it says that Patch Theory (as
conceived of by David Roundy et al, and therefore Darcs) doesn't have
the concept.

 > - All the complexity of commutation is exactly to do with sequencing.
 > - I've now learnt there's extra complexity to do with (attempting to) pull
 >   duplicate patches. (And how duplicates interact with conflicts.)
 >   If a repo were really just a set of patches,
 >      managing duplicates would be easy(?).

Unfortunately, I think not.  The problem is that unless you know the
semantics of the document, you can't know what is a duplicate (ie,
applying it twice results in invalid semantics) and what is a valid
second patch of the same appearance.  For example, consider token
replace: replace 'foo' with 'bar', merge a branch containing 'foo',
replace 'foo' with 'bar'.  That has obvious and probably correct
semantics; the second replace is not a duplicate.

You could argue that a patch should include preconditions.  The
simplest approach is to say "this patch applies exactly to this
revision".  Ie, the strongest possible precondition leads to the DAG
of revisions approach implemented by monotone, git, and similar
DVCSes.

In the theory, you may want to distinguish between *repo* (at its
simplest, a [multi]set of patches) and *branch* (repo + semantics for
constructing a working tree).  Of course you can simplify the
representation of a repo, but this comes at the cost of complexifying
the construction of working trees.  Or you can adopt *branch* (as I
just defined it) as your primitive concept, and say that a repo is a
branch.  Either way, the complexity remains.

 > - Possibly there's accidental complexity [per Fred Brooks]
 >     in having to commute patches,

The fact that commutation results in exponential complexity strongly
suggests this is essential complexity. ;-)  Seriously, your point
about preconditions (you called it "sequencing") immediately leads to
essential complexity.

 >     and perhaps there's accidental conflicts?
 >     (I've seen comments that for some conflicts
 >      the resolution seems 'obvious'.)

The examples I've seen require you to know at least the syntax, and
often the semantics, of the document's language.  However, all the
patch theory I've seen assume that the textual represention *is* the
semantics, so that two sequences of patches that result in the same
text are equivalent.  The problem is identifying weakest
preconditions.

 > > > You might also be interested in the following paper:  
 > (Loh/Swierstra/Leijen Principled Approach -- henceforth L/S/L)
 > > 
 > > The problem with that approach is that a practical VCS only gets
 > > to see the repo intermittenly (at record points).

Note that we should s/practical/standalone/ here.  In an editor like
Emacs it would be trivial to record all operations at the character
level (Emacs already records all operations).

 > > A lot can have changed (especially w.r.t. contents of files). How
 > > does the VCS know which files and which lines got copied/changed
 > > vs. deleted/inserted? How can it keep track of all the internal
 > > file id's and line id's?
 > 
 > This bites darcs with file moves (renames):
 >    you can't detect a move vs. remove-then-add-then-paste-content.

(I still don't understand why anyone cares to detect this difference.
The two are observationally equivalent in a single Darcs patch.  But
that's just me.)

 > In darcs:
 > - To think of a patch is the *same* even though it morphs through
 >   commutation, can we be more precise on what about the patch is
 >   the same?  (apart from its id)?  Something about the sameness of
 >   effect from a patch?

That's exactly how it's defined, as the same effect.  However, effect
is difficult to define explicitly, it's like the judge's definition of
pornography: "I know it when I see it."

 > So, how to look at primitive patches?
 > (as identified by a ppid, which might be a GUID):
 > - Aim to represent the effect of the patch in a context-independent format.

This is hard.  For example, the classic example is

- enum foo { foo1=0, foo2, ... fook };
+ enum foo { foo1=0, foo2, ... fook, fool };

- #define FOO_MAX K
+ #define FOO_MAX L

The *programmer's* intended effect is not to replace K by L; it's to
replace K by K+1, or equivalently by #(enum foo) after applying the
first hunk.  Of course the practical problem here is if one programmer
adds fool and another adds foom, the merger should contain both fool
and foom.  However the primitive patches for FOO_MAX in the two
branches will have L = M = K+1, and the VCS can't know from the
textual effect only that actually one or the other of L and M needs to
be K+2.  A typo correction, for example, is going to be a duplicate.
Worse, consider trying to commute those patches knowing only the text.

N.B. Although it's easy to avoid this particular case by making
FOO_MAX be the (dummy) last item in the enum, such references occur in
many other contexts such as Makefiles.  In practice, this kind of
semantics seems to be quite rare, and of course conflicts are even
more so.  Still, the theory should be able to handle this.

 > Pre-conditions will typically make reference to other ppid's.

So you have a preorder on ppids.  AFAICS, this is equivalent to Darcs.

 > There is no sequencing of patches as such, and therefore no
 > commuting.

You mean there is no rewriting of patches when commuting.  However,
patches by definition must be applied in some sequence (ie, the
preorder is "completed" to a total order, which is generally not
unique), and two sequences with the same effect are commutative.

 >              In general, if I'm pulling a bunch of patches,
 >              they should stay in the same sequence as in the source repo.

This is not at all obvious.  For example, it might make sense to use
some form of common-content compression in transmission, and depending
on the representation, on decompression the patches might be recreated
in a different order.  Or consider Darcs itself, where patches are
ordered arbitrarily by their names.  Finally, one might deliberately
randomize the order to ensure that order-dependency bugs are
discovered quickly.

 > So let's apply the same thinking to the content of a text file:
 > - all content of a text file can be described as a concatenation of hunks.
 > - each line can be tracked to the hunk operation ppid that inserted it
 >   (possibly overlaid with token replace ppid's).
 > 
 > Here's the novel bit: unlike L/S/L's idea of a line id, we'll do the same 
 > ppid 'trick' with hunks:
 > - Instead of 'locating' a hunk operation by its target file and line number,
 > - Locate it by target _hunk_ and line number
 >   -- that is, offset within the text inserted by the hunk.
 > - (And of course identify the target hunk by the ppid where it started life.)

I don't see where you're going with this.  This is a nice contribution
to the L/S/L theory in that it reduces the number of low-level (I'm not
comfortable calling "line id" a "primitive") concepts.

However, the Darcs operation of merging *repos* is already
conceptually trivial: the union of two sets of patches.  The *branch
representation* is then optimized by attempting to find an internally
consistent application sequence for the union repo.  If this doesn't
exist, you have a conflict, and one or more patches end up being
"commuted" into conflictors.  I don't see how your representation
helps to reduce the difficulty of handling "conflicted patches" (ie,
how it leads to a better representation than conflictors).  They're
still going to occur, and the recursive nature of your representation
still suggests combinatorial explosion in some cases.