[darcs-users] Re: an interface for splitting hunks

John S. Yates, Jr. jyates at netezza.com
Fri Apr 1 18:16:12 UTC 2005


"Benedikt Schmidt" <ry102 at rz.uni-karlsruhe.de> wrote:

> >> Here is how the diff code i'm working on for darcs (similar to the one
in
> >> GNU diff) creates the diff:
> > ...

and then again:

> standalone diff tool with the same diff code (for testing):
> http://www.stud.uni-karlsruhe.de/~ry102/darcs/hdiff

I was curious what actual algorithm you might be using.  I don't read
Haskell (yet) but I found this comment in Lcs.lhs intelligible:

  In the first step, a hash value for every line is calculated and
collisions
  are marked with a special value. This reduces a string comparison to an
  int comparison for line tuples where at least one of the hash values is
  not equal to the special value. After that, lines which only exists in one
  of the files are removed and marked as changed which reduces the running
  time of the following difference algorithm. GNU diff additionally removes
  lines that appear very often in the other file in some cases.
  The last step tries to create longer changed regions and line up deletions
  in the first file to insertions in the second by shifting changed lines
  forward and backward.

Years ago I collaborated with David Leblang on DSEE, the progenitor of
ClearCase.  Both systems were noted for the quality of their diffing.
While I did not work on the diff algorithm (I believe that David wrote
it himself) I did receive a verbal description.  I would swear that at
the time he told me that it was an implementation of a published idea.
Yet since then I have never been able to find anything that resonated.
A particular virtue of the algorithm was that it naturally identified
moves, rather than only adds and deletes.

(The algorithm used hashing to accelerate string comparison but I will
ignore that in the following description.  Also, as I am writing up
a conversation from 15 or 20 years ago about code I never actually had
a chance to study please excuse any lack of rigor or missing details:-)


The first step is to identify the set of distinct lines in each file.
In a given file a line may occur once or repeatedly.  Call this count
the number of (as yet) unmatched instances of that line within a file.
Thus each distinct line has two such counts, the count of unmatched
instances in file A and the count of unmatched instances in file B.
Initially this count can be zero in one file.  But it is meaningless
for it to be zero in both.

The crucial insight is that in real world text, while there may well be
numerous repeated lines, there are unique lines as well.  The pivotal
heuristic is to make unique lines present in both files "nucleation"
points from which to grow matching regions.

The heart of the algorithm is a worklist scheme.  The worklist always
contains the set lines for which both counts of unmatched instances
are one (1).

Growing a maximal matching region:

1) Remove a line from the worklist.  Since by definition this line
   exists at a unique (unmatched) location in each file align those
   two locations.  This line now forms the basis of a new matching
   region.

2) Now work forward and back from that location in each file comparing
   successive lines.  If the next line in the same direction in both
   files is identical it is added to the current matching region.  This
   means that a previously unmatched instance of that line in each file
   has now been matched.  Decrement its two unmatched instance counts.
   If the two counts are now both 1 then add this line to the worklist
   (since there remains exactly one unmatched instance of this line in
   either file).  If the two counts are now both 0 then that line must
   have previously been on the worklist so just remove it.

When this algorithm terminates each file is represented by an ordered
list of matched and unmatched regions.  To the extent that equivalent
matched regions occur in the same order in both files it has to be that
the reason for a discontinuity is either an addition:

   File A   File B
   ======   ======
     M1 ----> M1
     M2 -\    U+
          \-> M2

Or a deletion:

   File A   File B
   ======   ======
     M1 ----> M1
     U-   /-> M2
     M2 -/

More interestingly, an out of order matched region corresponds to a
move:

   File A   File B
   ======   ======
     Mm -\
     M1 --\--> M1
     M2 -\ \-> Mm
          \--> M2

David also had a refinement in which he treated leading whitespace
separately from the balance of a line.  That version of the algorithm
was able to recognize instances of altered indentation.  I will leave
working out that variant as an exercise to the interested reader :-)

/john







More information about the darcs-users mailing list