[darcs-users] so long and thanks for all the darcs

Ben Franksen ben.franksen at online.de
Sat Mar 24 09:52:18 UTC 2018


Hi Steve

sorry for (again) responding at length. You gave me yet more ideas...

Am 22.03.2018 um 01:15 schrieb Stephen J. Turnbull:
> Ben Franksen writes:
>> My experience with git tells me that when I make a clone what I get
>> is /not/ identical to upstream.
> 
> I may have miswritten. What is identical is the object database.
Okay.

> The refs are supposed to all be copied to refs/remotes/origin,
Hm, that may clarify a few things for me. So a "ref" is a file which
contains a hash that references an object. The content of a ref is
globally valid and thus can be (and is) copied between repos, but the
name of that file (including the directory in which it is located) is a
purely local property. Correct?

If yes, then I begin to understand why as a Darcs user I found it so
difficult to become familiar with git. Because this concept of a "ref"
has no (user visible) counterpart in Darcs. It doesn't exist because it
is not needed (for the user). We /could/ add something like it so we can
refer to patches symbolically, but AFAIK nobody has ever found it useful
enough to request it as a feature.

Whereas in git the concept is essential because many of the high-level
features that make git usable as a tool for day-to-day work are built on
it. The core of git is sound, simple, and elegant; but the high-level
features that build on it have been developed in an ad-hoc manner
without an over-arching and similarly elegant abstraction to guide the
design, so it remains necessary to understand the mechanism behind them
in order to use (and appreciate) them appropriately. I think /this/ is
the deeper reason behind git's "bad UI" reputation.

> but only the currently checked out one at remote is linked to a local
> branch, and checked out locally. Configuration ("core" options in 
> .git/config) comes from your local template, I believe.
Okay. I would expect that all local branches are initially linked to
their remote counterpart.

>> And clones of a clone are definitely second class, at least not
>> out of the box.
> 
> If you want an identical clone of a clone, rsync is your friend. But 
> I don't know what you mean by "second class". (I'm curious; I don't 
> doubt that it fails to work as expected.)
Unfortunately I can't remember the details. It may be that I just failed
to understand how to setup the second clone appropriately. Because...
see above.

>>> You've already admitted that it's necessary because of name 
>>> collisions. You just don't like it. :-)
>> 
>> I did not admit anything like that. There are no name collisions
>> if branch names are always relative to a repo URL. No need for 
>> namespacing branches.
> 
> "Relative to a repo URL" *is* a namespace.
Exactly. No need for any naming convention, since a perfectly natural
namespace already exists. Except that git allows to arbitrarily rename
remote branches, circumventing "qualification" with the remote URL, so
they look like local ones. This should not be allowed (IMHO).

>>> People used to do things like name their tracking branches 
>>> <remote>-<branch>, but that had two disadvantages. One, many 
>>> people did what you seem to find natural, and omit the 
>>> "<remote>-" part. After all, it's not my branch, so I won't work 
>>> on it, right?
>> 
>> No, this is not what I find natural. What I find natural is that in
>> my clone the beasts have the same name as in the remote repo from
>> which I cloned, at least by default.
> 
> I don't understand.
You seem to associate branches with an owner ("it's not my branch"). An
interesting aspect I haven't considered yet. So there is a person at the
remote repo who controls (or likes to control) the history of this
branch. He or she is upset if I push a commit to this branch? I should
rather have created my own branch and committed there, so the remote
owner of the branch can integrate my changes with a merge?

> AIUI, that's exactly what I said was natural: many users set up the
> tracking branch as a normal branch with the same name as in the
> source repo.
This would be perfectly okay /if/ this name could never be confused with
a remote branch of the same name, which would require that it is
impossible to give a remote branch a "plain", unqualified name.

>>> Turns out that for various reasons people *do* unintentially 
>>> commit to those branches.
>> 
>> So what? I don't see how this is a problem.
> 
> You don't see how anyone would commit to a branch they didn't intend 
> to, or you don't see how unintentional commits are a problem?
I don't see how a commit can be problematic even if it was made
unintentional. You commit explicitly by issuing a command, presumable
after making some changes. This creates a new version (commit object).
If this was indeed unintentional, what's the problem?

However, I see that if you accidentally push these changes, this can be
problematic (if you do not "own" the branch). Because apparently in git
if you push, then the remote branch ref is updated to where it points to
locally. Right?

(Sigh. Push and pull in Darcs have so much simpler semantics...)

>>> Second, whatever the name, you don't want to commit to those
>>> branches,
>> 
>> Why not?
> 
> Because that ref is the local copy of the remote branch's state, 
> needed for a rollback if there's a problem.
You've lost me there, partly. What is a rollback? In Darcs, rollback
means "apply selected parts of selected changes to the working tree in
reverse" but apparently in git it means something different. Neither
makes the rest of the statement any sense to me: what you committed and
how to get back to where you started could be calculated by comparing
the local with the remote DAG, right? So what's the problem?

>> Exactly what I was saying: they started with a cheap and harmless 
>> looking feature and then had to introduce lots of complexity to 
>> deal with the consequences.
> 
> In practice, the feature is quite the opposite of complex. I don't 
> know about Darcs projects, but if you look around Bazaar and
> Mercurial projects that have processes that emphasize review, they
> often recommend manual maintenance of a tracking branch. Bazaar even 
> provides several different ways of "binding" a tracking branch to 
> upstream to provide various levels of guaranteed synchronization,
> but you still need a separate branch/workspace for the tracking
> branch.
I never claimed you couldn't make things even more complex...

This really starts to remind me of all those programming language debates.

Alice: Java is bad, everything is mutable, IO everywhere, how can you
reason about your code or parallelize it? And it has no type inference,
that sooo tedious.

Bob: You should look at C/C++, Java is so much better, it has no raw
pointers, everything is an object, it is much more principled. And
anyway, without IO and mutable variables how can your program ever /do/
something interesting? (And who can read Haskell programs anyway, with
all these strange operators in them.)

Carol: C is at least a small, light-weight language, not like these
bloated monsters with a big runtime system and mandatory GC. And you
need raw pointers if you want to program system-level stuff such as
device drivers, try that in Java (or Haskell).

> git's approach provides less automation than Bazaar, but it's almost 
> entirely transparent (you don't need to refer to the tracking
> branch, except to link it to a local branch, or peripherally when you
> need to refer to a remote, for example when you get push rights and
> need to add an ssh URL).
I dislike this terminology of "tracking branch". It suggests some sort
of magical (behind the scenes) coupling of local and remote branches,
whereas you explained to me that there is nothing like that going on. In
my (Darcs influenced) way of thinking there is merely a default
target/source for the push, pull, and send commands (which applies if
none is given explicitly).

Besides, the grammar contradicts the intended meaning (AFAIU): the local
branch is tracking the remote one, so would rather be the "tracking
branch", whereas the corresponding remote branch should, if anything, be
called the "tracked branch". [Side remark: this is similar to calling a
thing an "iterator" even though it iterates nothing, instead it is being
iterated over, so should be named "iteratee". The active mode just
sounds more catchy. I'd rather use a completely abstract term (like
"monad") than one that is catchy but has misleading connotations.]

> I agree that my personal preference would be to set up local
> branches with the same names as all the remote branches, and then I
> could delete the ones that didn't interest me. But I don't consider
> that "complexity", just an annoying default where I'm in the
> minority.
A second requirement for me would be to fully internalize the
namespacing so that remote branches can _only_ be referred to as
remote-repo<separator>branch. But this is not how things work in git as
I understand (now).

>> I am not against refs but against remotes and how they mangled up 
>> both.
> 
> I don't understand why you think they're mangled. What goes wrong?
I mix them up in my head because they look the same. And I also detest
that I have to register remote repos locally in order to refer to them
in commands, giving them some arbitrary local name, when they already
have a perfectly good universally valid name (the URL). In Darcs I push
and pull between different repos quite often and I would find it
extremely annoying if I had to set up remote repo tracking each time. I
also rely on command line completion for that. (But I have to admit that
this is in part due to different clones representing what in git you
would use local branches for.)

> Specifically, in my own use I clone, set up fetch all, and I'm done.
How does one "set up fetch all"? I can't remember it being mentioned in
the introductions to git. I suppose it means that a fetch not only
fetches objects referenced by the corresponding remote branch but all
objects?

> I almost never need to refer to a remote or a tracking branch.
Suppose you have a local clone of the remote that you share with
colleagues and where you may have changes, only some of which you want
to share with upstream. (Other changes may be site specific adaptions or
configuration). You clone from that and work on it. There are already
three repos involved. I don't find this unusual.

> Once or twice a week I use a refs/remotes/origin ref in a diff. Once
> a quarter or less I need to look up the incantations for exposing a 
> non-default remote branch in my local namespace.
Referring to any arbitrary remote should be just

  <remote URL> <separator> <branch>

without having to set up anything. That's my HO, at least.

>>> Yes, I understand that. My point is that you need to do some 
>>> implementation, just as git would have to to emulate patches.
>> 
>> No you do not, as I explained in detail.
> 
> I understand the detail. You can create a DAG in Darcs, without 
> implementing anything more. But there's a lot more to git or 
> Mercurial than just a data structure.

Certainly. It would be interesting (but perhaps a bit tedious) to
analyze in detail which operations on the core data structure are
available or allowed in git vs. Mercurial vs. Darcs-Emulate-Them.

>>> I agree that the implementation is trivial in the data structures
>>> and quite manual in the operations, but I'm not sure what "deep"
>>> support would be, given the requirements that led to their
>>> implementation.
>> 
>> Sure. My comment wasn't mean as a criticism.
> 
> I didn't take it that way. I was asking if you had an idea of what 
> "deeper support" might look like.

I haven't thought about it in depth, yet. The problem is that subrepos
(as I would rather name them) mix two things of quite different nature
(subdir in a tree vs. another repo). My spontaneous reaction to this is
"type error" ;-)

It may be possible to make sense of it in Darcs by adding another kind
of primitive patch for adding and removing subrepos, similar to (but
distinguished from) adding/removing directories. The ideal candidate for
such an integration would be the experimental variant of primitive patch
types where we assign UUIDs to all tree objects (files, dirs) as soon as
they are created (recorded) for the first time, an idea I previously
mentioned in passing. This gives them an identity independent from their
name (path).

The idea is that if two people independently create an object (file or
dir) with the same name, then the "name collision" is most likely
accidental: for both persons the name appears as "not yet used" (in this
project) and thus free for the taking. Which means that version control
should /not/ treat them as the same object, but as different objects.
This means that the conflict between the two additions is merely one of
putting two different things in the same location, which is easily
resolved by renaming one of them; whereas any changes /to/ these objects
never conflict with each other, as they change different things. This
leads to much better commutation properties.

The mechanism behind this is an object map that associates UUIDs with
either a file or directory. And subrepos could be represented as yet
another variant.

This doesn't sound too bad, as a rough idea, but there are many, many
details left to fill in.

> In general, your language makes it quite clear when you are
> neutrally stating a fact, when you find something painful but aren't
> sure it's avoidable, and when you think something is stupid. So far
> only "remotes" seem to be near the third class. :-)

Ah, good!

>> Yes. Which means such a split really only makes sense if it is 
>> accompanied by a split in the development team. And if changes are 
>> normally contained to one submodule. Both of which weren't the
>> case here, so I guess it's been an abuse of submodules.
> 
> Not knowing the details, all I can say is "sounds plausible". But if 
> it seems like it would help, feel free to quote me. :-)

Thanks... not that it would make much of a difference. It's a difficult
situation because the split was motivated by the integration of a whole
bunch of new functionality that was developed independently (roughly
half of the submodules). I guess the idea was "while we're at it,
let's split off these other parts, too, so we have a more symmetric
picture" or some such. Anyway, they are considering to revert that
decision because making changes that overlap multiple submodules is now
a lot more painful.

>> We do have that, even though it is "only" a cache. I mentioned it
>> in passing, it is called the 'pristine tree' in Darcs. It is a
>> rather important optimization, otherwise we'd have to reconstruct
>> the tree every time which would make Darcs extremely slow. (We
>> currently have only one per repo, but that would have to change
>> when/if we add in-repo branches).
> 
> git has one, per directory in the working tree, *per commit*.

My first thought was that must cost a huge amount of disk space but of
course that's not true since all identical objects are shared in the
database, right? I haven't looked into the internals of Darcs's
representation of the pristine tree, yet, but it's supposed to be
"hashed" so it may already use a git-like storage model.

>>> You can have a subtree and replace it with a submodule or
>>> vice-versa. Diffs and things like that will do the right thing.
>> 
>> I think I understand. So if I merge a version that has x/y as a
>> regular blob and another version where x has been added as a
>> submodule, then I get a conflict?
> 
> Yes. And if you want to preserve history it would be annoying and 
> tedious to resolve, unless they're completely independent so that
> the logical thing to do would be to rename one of the directories.
> The annoyance comes from the negotiation as to whether x/y belongs
> in submodule x.

Our future (hopefully) UUID based patch theory could have an advantage
here. I guess we could drive the "deep integration" to the extreme and
even allow to "rename" tree objects from a repo to a subrepo (or the in
other direction, or between different subrepos; and what about nested
subrepos?). But such features should be well thought through before we
make the mistake of bloating Darcs with features that nobody needs and
that complicate everything!

> The tedium would be using filter-branch to move commits from the
> parent branch to the submodule or vice versa. In a DAG-based VCS, you
> need to rewrite all affected commits and all their descendants.
> 
> In Darcs the annoyance would still be present, but the tedium could
> be handled with sed: just a matter of moving patches from one to the 
> other and then fixing up the paths naming the files involved (the 
> roots change). I imagine fixing up the inventories there would 
> already be a command ("repair", maybe)?

I am more inclined to consider a "deep integration" of subrepos such as
I sketched above, where we track such movements as kind of a primitive
patch and thus don't have to "rewrite history", just commute patches as
usual.

Okay, now you've /got/ me think deeper about this idea. I'm fascinated
by the possibilities :-)

>>> The complexity comes in if you change anything in a submodule. 
>>> [...] There doesn't seem to be a typical case, so at least for 
>>> now it's entirely up to the user to figure it out, and the UI is 
>>> multistep and therefore errorprone (aka "complex").
>> 
>> Yes. Sound like somehting to avoid unless there is a complelling
>> reason not to.
> 
> Agreed. "I feel your pain, brother" if you had submodules inflicted 
> on you for insufficient reason, and without sufficient forethought 
> about workflows.

Thankfully I was not in a position where I had to feel the pain myself,
but I have heard others complain about it and I could relate... (and at
some point I probably will encounter these issues; perhaps I'll be lucky
and they decide to undo the split before that happens).

>>> Partial git compatibility and faster checkouts and other 
>>> operations on arbitrary known versions.
>> 
>> Ah, yes. This is one of the properties of Darcs that I have 
>> observed makes people nervous: there is no way to recover a
>> certain state (tree) that your repo had in the past unless you have
>> tagged that state (or made a clone and stored it somewhere).
> 
> Described that way, sounds like it would make me nervous, too. :-)
> On the other hand, in practice I generally have refs for things I
> refer to, except when doing something like bisecting, when recursing
> along parents does the trick. So I'm not sure how important it would
> be in practice for open source. If commercial licensing were
> involved though, legal likes to have everything documented I guess.

Fortunately my experience with proprietary software / development is
very limited, so I can't comment on that last point.

The nervous feeling is something I fully understand and that
occasionally bothers me, too, with Darcs. I have made an effort to
describe a possible solution on our issue tracker and was reminded that
these things have been discussed in the past and there are several
proposals to rectify it. So this is an active area of research for us; I
hope I can come up with a prototype implementation of what I had in
mind, so we have something concrete to experiment with.

Cheers
Ben
-- 
"I tend to avoid fiction about dysfunctional urban middle-class people
written in the present tense." -- Ursula K. Le Guin




More information about the darcs-users mailing list