[darcs-users] Request for more details

Wed Nov 11 11:10:05 UTC 2020

Hi Dalon

Am 10.11.20 um 14:51 schrieb Dalon Work:
> I recently discovered Darcs, and probably like lots of people, I'm very
> intrigued by the "Patch Theory" behind it. I've read everything I could
> find, and to be fair, much of it was over my head. I think I understand the
> basic ideas, but I still have a lot of questions. I would appreciate it if
> anyone could point me at some materials I missed or misunderstood, or could
> explain in greater detail how it all works.

Unfortunately there is no ultimate reference for patch theory that
covers every aspect and every variant implemented by darcs.

> So, starting with a set of patches with a linear history: AB, and if you
> want to "undo" A, you have to reorder A & B so you can undo A.

Correct. However, note that there are two ways to undo a patch A: you
can record a new patch that inverts all changes made by A. Or you can
just drop it from the end of the sequence. The first is 'darcs rollback
+ darcs record', the second is 'darcs obliterate'.

> Take AB,
> commute them to get B'A', and then post-multiply by the inverse of A':
> 
> B'A'.inv(A) = B'.

These sequences are not equal, though they have the same effect. In
general we consider two sequences of patches equal if they differ only
by commutation; writing A^ for the inverse of A and [] for the empty
sequence, AA^ and [] are not the same. In /some/ contexts we can treat
them as equivalent, for instance when applying them to a tree, but not
in general.

> I understand the basic diagram that gives the equation 0 -> AB = B'A' -> 1,
> where we start with context 0, and want to get to context 1. But I don't
> understand how we actually solve for B' and A'? It seems we have 2
> unknowns, but only 1 equation. This is the part I can't seem to find.

The underlying theory of (so called) primitive patches must have come
already equipped with such a commute operation. How it is implemented
depends on exactly what the primitive patches are (e.g. their
representation) and on what the designer of the theory wants to achieve.
As an extreme example, a theory in which no two primitive patches can be
commuted is valid.

> Is
> this a manual step? I can see how if the two patches have nothing to do
> with each other, then B' = B, and A' = A. But what if they do influence
> each other, but don't actually conflict?

If you are able to read Haskell, you could look at
https://hub.darcs.net/darcs/darcs-screened/browse/src/Darcs/Patch/Prim/V1/Commute.hs
to find out how it is done in darcs. This is not covered in any text on
patch theory: these texts always assume a primitive patch theory as
given (subject to certain laws).

> Then, in the case of a merge of two parallel patches, how do we know what
> the "unique" merge result is?

Uniqueness of (clean i.e. unconflicted) merge follows from uniqueness of
commute, given the usual patch laws, since the unconflicted merge is
defined as "invert one of the two patches, commute, then invert result
back".

> (Assuming no conflicts). The equation is now
> 0 -> AU = BV -> 1, but now we don't even know what the ending context 1
> looks like! This to me looks like we have 3 unknowns, and only 1 equation.
> What am I missing here?

Again, the key is to assume you already have a function commute that
takes a sequentional pair of patches and gives you back another
sequential pair.

> My next question is about scalability. In a system with hundreds of
> thousands of patches, it seems like a scalability nightmare, as the naive
> description of how the system works makes it seems that to reconstruct the
> state of the repository would require building the working tree from the
> empty context every time.

Correct. Without additional efforts, a "naked" patch theory would be
extremely slow!

> There has to be a way to "snapshot" a set of
> patches so you have a new ground-zero to start from.

And there is (in darcs). We call it the "pristine tree", which is very
much like a "tree object" in git (also stored in a hashed format,
similar to git). It corresponds to the current state of a repo i.e. the
sequence of recorded patches. You could, in principle, store
intermediate snapshots, too, e.g. for each tag, but this is not
currently done in darcs.

Darcs has other "snapshot" like optimisations. For instance, the
sequence of patches in a repo is broken up into so called inventories,
which are sequences of patches terminated by a tag that depends on all
previous patches. This means that most operations only have to work with
the "current inventory" which contains (only) the patches since the
latest tag.

> And then, importing
> patches from an outside repository, is it possible to "batch" them, and
> work on a sets of patches inside of comparing patches 1 at a time? (ignore
> conflicts for now).

Not sure I understand the question correctly. Ultimately you need to
merge patches one by one and also apply them one by one.

> It seems to me
> that you would have to take all the incoming patches, and 1 at a time,
> modify the context for each one of them to apply them on top of the
> importing repository. It sounds painfully slow.

Starting with the current state (the pristine tree) and applying patches
(or unapplying them) to it to get a new state is pretty much optimised
in darcs but dpenending on the number and size of patches can still take
some time. An extreme case is doing 'darcs check' on the current darcs
repo with about 13000 patches; this takes about 15 seconds on my laptop.

But normal operations seldom consider more than a hand full of patches,
perhaps a few hundred in bad cases.

> How do you determine which patches are dependent on which patches? I know
> the user can manually specify dependencies, but there has to be some
> automated way to figure out the important ones, the ones that would break
> things if not maintained correctly.

Simple: B depends on A if A;B does not commute.

> Naively thinking, would you have to try
> and commute the incoming patch with every other stored patch? Or can you do
> that until you find a match, and attach it to the "dependency tree", or are
> there encoded manual rules?

There is no explicit representation of the dependency graph anywhere in
darcs. It wouldn't help, since patches can change representation when
commuted, so even if you new beforehand that A and B will successfully
commute that still doesn't give you the result of the commute.

To understand how darcs merges patches it is necessary to consider patch
"names". The name of a patch is a hash of all its meta-data (name, log
message, timestamp, author, and some random noise added when you
record). This is what identifies a patch semantically. Darcs assumes
that patches with the same "name" really are commuted versions of the
same original patch. This is insecure, but allows to quickly compare
large sets of patches for equality.

When we merge patches from another repo, we first find the set of all
patches common to both the local and the remote repo. What we actually
have to merge are the remaining sequences of patches i.e. those that are
only in our repo and those that are only on the remote one.

Finding the common set of patches is further optimised using inventories
but I guess explaining that would lead us too far away.

> What is the internal patch format for darcs? Like, could I implement a
> crude patch-based system using unix diff/patch utils?

No, that is not possible. The basic patch operations MUST strictly
adhere to the laws of patch theory.

> Or do you have to
> store extra metadata to get the mechanics to work correctly? So far, all
> I've found is the high-level "a patch is a transformation function". Great.
> What does that function actually look like in real-life?

As I mention above, this is not covered by the existing texts on patch
theory. The point is that this function must be reified as data!

I will give you an example. In Darcs, the most important primitive patch
type is called "hunk". A hunk consists of the following data:

 * file name
 * position (line number)
 * lines to remove
 * lines to add

In short, we have

  Hunk filename pos [ContentLine] [ContentLine]

where ContentLine is how you represent a line of text.

To commute two Hunks very roughly goes as follows:

If the filenames differ, they commute trivially (i.e. without change).
If what the first patch adds overlaps with what the second one removes,
then they fail to commute. Otherwise, we can commute them by adapting
the line numbers. The details are a bit tricky (fiddling with line
numbers, what exactly means "overlap") but not hard to reproduce.

For details, see
https://hub.darcs.net/darcs/darcs-screened/browse/src/Darcs/Patch/Prim/V1/Commute.hs#214

The definition of the primitive patch data type is
here:https://hub.darcs.net/darcs/darcs-screened/browse/src/Darcs/Patch/Prim/V1/Core.hs#40

It is not difficult to define your own primitive patch theory, but
making sure it conforms to all the required patch laws is another
matter! We never formally proved that for darcs, but instead rely on
random property testing using QuickCheck.

Cheers
Ben