[darcs-users] so long and thanks for all the darcs

Ben Franksen ben.franksen at online.de
Thu Mar 29 19:46:03 UTC 2018


Am 29.03.2018 um 10:08 schrieb Stephen J. Turnbull:
> Ben Franksen writes:
>  > If yes, then I begin to understand why as a Darcs user I found it so
>  > difficult to become familiar with git. Because this concept of a "ref"
>  > has no (user visible) counterpart in Darcs. It doesn't exist because it
>  > is not needed (for the user). We /could/ add something like it so we can
>  > refer to patches symbolically, but AFAIK nobody has ever found it useful
>  > enough to request it as a feature.
> 
> That sounds almost right to me.  The exception is a tag, which is
> present in Darcs and induces a version via its dependencies, whereas
> in a DAG-based VCS it is a ref, and points to a version in the history
> graph.

Internally we do use references, similar to git (we refer to patches,
inventories, and trees via content hash). But in contrast to git, these
are not exposed as a user visible concept. Tags are somewhat special;
they do serve to identify versions, i.e. what git uses refs for. But
since their behavior is specified in terms of patch dependencies, they
are not really an exception to the rule.

>  > Whereas in git the concept is essential because many of the high-level
>  > features that make git usable as a tool for day-to-day work are built on
>  > it.
> 
> In fact, any DAG-based VCS requires refs.  Mercurial and Bazaar add a
> sequence number ref type (and in Bazaar it's actually a structured
> sequence number which identifies the branch and version relative to
> the branch point, recursively \o/).

Okay. I see how git may look simple when you compare it with bazaar...

>  > The core of git is sound, simple, and elegant; but the high-level
>  > features that build on it have been developed in an ad-hoc manner
>  > without an over-arching and similarly elegant abstraction to guide the
>  > design, so it remains necessary to understand the mechanism behind them
>  > in order to use (and appreciate) them appropriately. I think /this/ is
>  > the deeper reason behind git's "bad UI" reputation.
> 
> From my point of view, all you've said is "people don't grok DAGs". :-)

Ah well, i do keep telling people that the memory in their computers is
nothing but an array of bytes, but they still find them complicated, so
perhaps they just don't grok arrays?

This reminds me of the old haskell joke "A monad is just a monoid in the
catgory of endofunctors, what's the problem?". What I personally find
funny about this is that it's actually true and I don't mean in the
mathematical sense. The concepts mentioned in this short sentence are a
lot harder to understand than "DAG" or "array", but once you do,
actually using them in a way that reliably gives you the results you
expect is quite simple. Whereas understanding the array concept is easy
but using them is frought with hazards (especially if you assume
everyone has access to every element and can modify it at will...).

>  > > but only the currently checked out one at remote is linked to a local
>  > > branch, and checked out locally. Configuration ("core" options in 
>  > > .git/config) comes from your local template, I believe.
> 
>  > Okay. I would expect that all local branches are initially linked to
>  > their remote counterpart.
> 
> This would be really ba-a-ad if you were working on the kernel where
> everybody is cloning everybody else's repos all the time.  The git
> developers all are kernel developers. ;-)

I don't get why this would be bad, except...

> I would think "link 'em all" is a better default for most projects,
> except that in git branch refs are really lightweight, so developers
> are likely to have a bunch of obsolete or experimental branches lying
> around that you don't want.

Good point. I was thinking about "official" branches only, not
experimental/feature/whatever branches that anyone can and does create
all the time.

(But then you also don't want the commits on these branches, right?)

Which is the better default then depends on the preferred work-flow in
your project.

>  > > "Relative to a repo URL" *is* a namespace.
> 
>  > Exactly. No need for any naming convention, since a perfectly natural
>  > namespace already exists. Except that git allows to arbitrarily rename
>  > remote branches, circumventing "qualification" with the remote URL, so
>  > they look like local ones. This should not be allowed (IMHO).
> 
> This is how Subversion works (and CVS before it and Bazaar
> "lightweight checkouts" after it).  With that restriction, distributed
> development is painful.  Avoiding that restriction is why Arch,
> BitKeeper, git, Mercurial, Monotone, Bazaar, ... were developed.
> Darcs, too. :-)

I don't understand. What has distributed versus centralized to do with
it? I'd say in a centralized system there is only one "remote", so the
question is moot. Is that what you mean?

> Most people prefer working within that restriction to dealing with
> concurrency, I admit, but highly concurrent development is really
> painful with it.  These systems all force you to rebase all your work
> on top of the official version before you can commit even once,
> because when concurrent development is taking place, there are
> multiple branches with the same name in this model: one local, one
> official, and possibly others local to other developers (or even
> yourself!)  Determining whether you are synched to official requires a
> remote query with an irremediable race condition, and it's impossible
> to know if you're synched with third party branches with that name.

This is all uncontroversial IMO and has nothing to do with the question
we are discussing.

> Darcs avoids all this by modeling a branch as a history-less set of
> patches.  Of course the semantics of text require certain implicit
> dependencies (you can't delete a line that doesn't exist in the text).

(there are systems that do allow that, but not Darcs)

> Also of course you want semantic dependencies (don't add a patch
> calling foo() in module bar if you don't have the patch that adds
> module foo, for example).  History-based VCSes satisfy the text
> requirement automatically, and mostly human programs do satisfy the
> semantic requirement too, but of course they also drag in a pile of
> spurious dependencies.  Darcs avoids the spurious dependencies at the
> cost of requiring explicit specification of semantic dependencies --
> but again the natural human tendency to do first things first means
> that most of the time you don't need specify them: you won't try to
> commute the call to foo() backward past the add of module foo.

I am not sure I want "semantic" dependencies. The best a general purpose
text based tool can give you is a crude approximation of it.

The version (DAG) based systems approximate on the "safe" side: any
change, even the smallest, semantically irrelevant one, introduces a
dependency.

Darcs chooses to err on the "flexible" side: by default only the minimum
(technically necessary) dependencies are introduced.

I personally prefer the flexible approach and are prepared to deal with
the occasional "missing" dependency. In fact I often want even more
flexibility then Darcs offers, which is why I am favoring a new
foundation for Darcs that allows more patches to commute.

> (2) the same branch has multiple URIs (in git there are git:, ssh:,
>     http:, and https: URIs at least, and they frequently have
>     different paths), which is why URI naming isn't good enough.

An interesting point I hadn't considered yet. But can git give the same
name to different URLs? I think it cannot, else how would it know what
it should do when I say 'git pull <remote>' (i.e. should it use ssh or
http?). So how do "remotes" help to manage the different URLs for the
"same" remote repo?

> You don't have to like it, but there are strong reasons for doing it
> this way if you want your development organization to scale to many
> developers working independently on anything they want to.

I must say that I lack the experience of working at a scale of something
like the Linux kernel. But I do value the possibility to push and pull
patches from any repo to any other as darcs allows me to and I am using
that feature in practice. I am pretty sure the Darcs model would scale
to a large number of developers but I have no proof.

> If there are multiple people with push
> permission, your *VCS* will need a conceptual way of referring to
> content that is intended to end up in the "official" branch that
> diverges from other content also destined for that branch (or already
> incorporated in that branch).

Of course. But then, assuming I do not want to push changes to "master"
because this is how the project is organized, then I just don't do it,
right?

>  > I should rather have created my own branch and committed there, so
>  > the remote owner of the branch can integrate my changes with a
>  > merge?
> 
> I'm not saying "you should", I'm saying "you do".  In a DVCS, by
> committing locally you *do* create your own branch.  Its content is
> *not* identical to the remote branch.  This is just as true in Darcs:
> your repo contains a patch not in the upstream repo.  You don't know
> that your patch is *the* extension of the branch because of the race
> condition.  You may need to rebase or merge (in Darcs, amend the
> patch) before the push.  In both systems, evidently we intend a merge
> and push, but at the moment of the commit, the fact is that the repos
> (including those of third parties we may or may not know about) *are*
> divergent.

Yes, exactly. I took all this for granted, which is why I asked "so what"?

>  > > You don't see how anyone would commit to a branch they didn't
>  > > intend to, or you don't see how unintentional commits are a
>  > > problem?
> 
>  > I don't see how a commit can be problematic even if it was made
>  > unintentional. You commit explicitly by issuing a command,
>  > presumable after making some changes. This creates a new version
>  > (commit object).  If this was indeed unintentional, what's the
>  > problem?
>  >
>  > However, I see that if you accidentally push these changes, this can be
>  > problematic
> 
> This is what I meant, but did not write.  :-(

Perhaps as a Darcs user (remember: we have no "branches", just repos) I
do not think about this as something problematic because it is too
obvious. /Of course/ when I clone, then record a patch, then push that,
then this adds my patch to the original repo. What else should happen?
If I don't want that, well I just don't push. If I want to publish my
changes without pushing to the original repo, I must clone (upload) to a
different location.

With branches, things may be different. It may make sense to have push
behave in a "safe" way by default, that is, create a different remote
branch. I am not sure I would want that but it is worth considering.

>  > Because apparently in git if you push, then the remote branch ref
>  > is updated to where it points to locally. Right?
> 
> Yes: "push" implies "to a specific branch".  I don't understand the
> "apparently"; what else would you expect as a Darcs user?

I have become wary of transplanting /any/ kind of expectations from
Darcs to git.

>  When you
> push, the remote repo gets updated so that someone who clones or pulls
> right away gets the same repo you have, no?  Isn't this communication
> of new content to a specific line of development in a controlled
> fashion the whole point of push?
> 
>  > (Sigh. Push and pull in Darcs have so much simpler semantics...)
> 
> Only because you don't have multiple branches in one repo, so URL of
> repo == name of branch == only ref that ever matters to you, and it's
> mostly trivial to keep track of "here vs there".

Yes, there is some truth to that. I would very much like to retain this
conceptual simplicity even when/if we add branches to Darcs. I think
that if we use the current model as a guide, then we can achieve that:

A URL+branch in Darcs-with-branches behaves like a URL does now. A
branch alone is short for "the local repo"+branch. A URL alone means
URL+"default branch", where "default branch" is the name of the branch
you are on, unless configured otherwise. Everything else remains as it is.

>  > >>> Second, whatever the name, you don't want to commit to those
>  > >>> branches,
>  > >> 
>  > >> Why not?
>  > > 
>  > > Because that ref is the local copy of the remote branch's state, 
>  > > needed for a rollback if there's a problem.
>  >
>  > You've lost me there, partly. What is a rollback? In Darcs, rollback
>  > means "apply selected parts of selected changes to the working tree in
>  > reverse" but apparently in git it means something different.
> 
> I mean it in the database sense: rollback the pending transaction
> ("pending" meaning "not yet pushed").  In Darcs, this would be
> "obliterate", I believe.

Another point where it is problematic to transfer concepts naively. Yes,
in a way this is what obliterate would do, more specifically 'darcs
obliterate --not-in-remote'. This is not something I would associate
with "rollback" in the transactional sense, even though i admit that
technically it is. (My view of transactions is that they are short-lived
deviations from the one-state-for-all norm.)

>  > Neither makes the rest of the statement any sense to me: what you
>  > committed and how to get back to where you started could be
>  > calculated by comparing the local with the remote DAG, right? So
>  > what's the problem?
> 
> You don't know when the ref has moved in the remote DAG (git doesn't
> record timestamps for push, and both author and committer commit
> timestamps can be forged at commit time, which is different from push
> time), so that's not useful.

Hm. When I transfer this back to Darcs, the remote repo has accumulated
more patches. As long as I do not pull them, my reference point is
unchanged, so 'darcs obliterate --not-in-remote' really brings me back
to where I started.

> You do know it has moved in the local DAG, and you know that it was an
> ancestor commit, but *git* doesn't know by how many commits.  You can
> perform such a rollback by hand, but git cannot implement a rollback
> command.  If you cannot commit to the tracking branch (as in current
> git) you can implement a rollback command.  Git doesn't call it
> "rollback", it's spelled "git reset origin/master".

I see. In my view of things this is yet another point where the Darcs
model gives you an advantage without any additional effort.

> Concurrency itself is complex.  My claim is that git's basic set of
> operations:
> 
> - init
> - clone        # exceptional: takes URL argument
> - fetch
> - push
> - commit
> - branch
> - checkout
> - reset
> - merge
> 
> with no options (except for reset --hard) and only ref arguments
> (except for clone) is about as simple as you can get for maintaining a
> history DAG under the requirements that development is concurrent, it
> is not coordinated, and you want full control of your branch refs (in
> modern git these are all local).

This may very well be true.

But then why do all these commands have literally hundreds of options
and the man pages "explaining" them are stuffed with technical details
that no one except a hand full of experts understand?

Try to google "how to xyz with git" where xyz is something simple
occurring in your daily work. You'll find all sort of answer on
stack-overflow, none of which is a single command with no options.

> The problem is that git, because it's so simple and unopinionated,
> allows you to play all kinds of tricks with the DAG that in other
> systems would require applying and unapplying patches, and might
> impose ugliness on history.  

I have no problem with a tool that is powerful and allows me to play all
kinds of tricks, as long as these tricks don't violate the internal
invariants that hold everything together.

> Folks who favor Bazaar and Mercurial often argue to the contrary that,
> as a matter of pragmatic software engineering practice, frequent use
> of rebase results in a mainline that is only tested intermittently, or
> perhaps not at all, despite 100% testing of pre-rebased commits.  I
> understand the point but I think this is mostly an organizational
> workflow problem, and to some extent an issue of testing resources.
> Note: this issue was first recognized by Linus himself when he flamed
> David Miller for excessive rebasing.

Yes. And this is exactly why we have a patch theory in Darcs! I
mentioned it before: the patch algebra laws are exactly what is needed
to ensure that re-ordering of patches gives consistent results under all
circumstances. In git, when you rebase, it is the user's responsibility
to ensure consistency, and humans are notoriously bad at things like that.

>  > I dislike this terminology of "tracking branch".

[I snipped your explanation, the main point being:]

>> Besides, the grammar contradicts the intended meaning (AFAIU): the 
>> local branch is tracking the remote one, so would rather be the 
>> "tracking branch", whereas the corresponding remote branch should, 
>> if anything, be called the "tracked branch".
> 
> That's exactly right. Remember there are *three* branches here: the 
> remote branch (which is not called "tracked" because the tracking is 
> automatic and implicit), the tracking branch ref (which gets out of 
> synch when the remote branch gets updated by another developer, and 
> so needs to be updated by fetch), and the local "working" branch.
Ugh. Three branches. I think I begin to understand. Yes, in this case
the terminology makes sense.

>  > whereas you explained to me that there is nothing like that going
>  > on. In my (Darcs influenced) way of thinking there is merely a
>  > default target/source for the push, pull, and send commands (which
>  > applies if none is given explicitly).
> 
> Sure, and once again this is not going to work as written when you
> have multiple branches in one repository, which are going to have
> different target/sources.

I am still not convinced of this, if we model Darcs branches in such a
way that they are equivalent to what a Darcs repo is now.

But we need not discuss this any further. Either we manage to come up
with a better model in a future Darcs, proving my point, or we don't, in
which case the question is moot.

>  > I'd rather use a completely abstract term (like "monad") than one
>  > that is catchy but has misleading connotations.]
> 
> The categorists I know think that Haskell's use of "monad" is an
> abomination, because Haskell doesn't enforce the monad laws; the
> ensuing bugs when a programmer fails to enforce them force the
> programmer to go back and DTRT. ;-)

Oh, come on, this is ridiculous. They should refrain from throwing
stones, sitting in that glasshouse of theirs. Yes, they demand proof
when someone claims "this is a field/monad/ring/whatever", but so do we.
It's just that in Haskell there is (currently) no way to express that
proof /inside the language/ so it can be automatically checked by the
compiler. How many of the theorems these guys use day in day out have
been formally verified, if I may ask? Right, not many. So this is
applying double standards and they know it.

That said, I do agree that the way Haskell uses the term monad can be
misleading, from a categorical POV, but for a different reason: The
mathematical monad is parametric in the category, whereas the Monad type
class is not: its category is "hard-coded" as that of Haskell types and
functions between them. This is why we carefully distinguish between
"monad" (the concept form category theory) and "Monad" (the type class).
And similarly for functor vs. Functor etc.

>  > A second requirement for me would be to fully internalize the
>  > namespacing so that remote branches can _only_ be referred to as
>  > remote-repo<separator>branch. But this is not how things work in
>  > git as I understand (now).
> 
> You're right, it doesn't work that way, and because of the multiple
> URLs referring to a single repo issue, it never can.  You need a
> convention so that git can always do the right thing once the
> configuration is what you want.

We will see if we can come up with something better. I like it if my
tool does the obvious thing right out of the box without any need for me
to configure it.

>  > I mix them up in my head because they look the same. And I also
>  > detest that I have to register remote repos locally in order to
>  > refer to them in commands, giving them some arbitrary local name,
>  > when they already have a perfectly good universally valid name (the
>  > URL).
> 
> s/the/a typically non-unique/ ;-)

I may be wrong but I don't think this matters. Names are never unique.
They merely need to be unambiguous. File names aren't unique either (due
to cwd, and links and so on), still everyone uses them to refer to files.

Keeping with the analogy what git does with remotes is similar to the
current working directory with some symbolic links in it. Wonderfully
convenient, you can refer to far away files easily without spelling out
the long path. But very dangerous in case you happen to assume you are
in some directory but are in fact in a different one. And extremely
confusing if you link different paths to the same name in different
working dirs.

> But as far as I know, with the exception of diff, all commands where
> you want to refer to a remote allow you to use any of the URLs that
> refer to it.

Yes, sorry. I was remembering this wrongly.

>  > In Darcs I push and pull between different repos quite often and I
>  > would find it extremely annoying if I had to set up remote repo
>  > tracking each time. I also rely on command line completion for
>  > that. (But I have to admit that this is in part due to different
>  > clones representing what in git you would use local branches for.)
> 
> Yup.  Use branches: annoyance evaporates! ;-)

I reserve the right be remain doubtful as to that... ;-)

>  > > Specifically, in my own use I clone, set up fetch all, and I'm done.
> 
>  > How does one "set up fetch all"?
> 
> The simplest way I know is
> 
> cd git-repo
> for ref in `git branch -r | grep -v 'HEAD\|master'`; do
>     git branch --track `basename $ref` $ref
> done

If this is the simplest way, I prefer complicated ;-)

Okay, so git branch -r lists... what exactly? Ah, okay, "Option -r
causes the remote-tracking branches to be listed". Of the result filter
out HEAD and master, okay. Then do 'git branch --track'... which
means... here we go:

"""
When a local branch is started off a remote-tracking branch, Git sets
up the branch (specifically the branch.<name>.remote and
branch.<name>.merge configuration entries) so that git pull will
appropriately merge from the remote-tracking branch. This behavior may
be changed via the global branch.autoSetupMerge configuration flag. That
setting can be overridden by using the --track and --no-track options,
and changed later using git branch --set-upstream-to.
"""

I am getting headaches from this. I think it means (but I am far from
sure) that to get the behavior I want, I should checkout a remote
tracking branch and then start a local branch from that?

>  > I can't remember it being mentioned in the introductions to git.
> 
> It's not.

I am not surprised.

>  git users usually have a bunch of obsolete or experimental
> branches lying about, that you would not be interested in tracking.

Granted. So how does git know which branches you are interested in and
which not? Simple: you are (supposed to be) interested in whatever the
remote has named "master". No?

Once again, I prefer to (have to) tell it explicitly which branch I want.

> So I do it by hand because "all" really means "all interesting", and
> there's usually only one or two of those.

Understood. For a public repo that contains "the" official
version(s)/branches this may be different. For personal repo where a
developer uploads all the stuff he/she is working on, you could clone
the one branch you are interested in. (I don't know if you can clone a
branch in git.) BTW, don't people clean up their repos every now and
then i.e. throw away obsolete branches?

>  > I suppose it means that a fetch not only fetches objects referenced
>  > by the corresponding remote branch but all objects?
> 
> fetch's basic syntax is "git fetch [options] [repository [ref ..]]".
> 
> There's a default repository (usually "origin").  git fetches all
> objects referred to by all configured branches for that repository
> (usually all of them), unless refs are specified, then it limits to
> those remote branches.

Ah, thanks.

>  > > I almost never need to refer to a remote or a tracking branch.
> 
>  > Suppose you have a local clone of the remote that you share with
>  > colleagues and where you may have changes, only some of which you want
>  > to share with upstream. (Other changes may be site specific adaptions or
>  > configuration). You clone from that and work on it. There are already
>  > three repos involved. I don't find this unusual.
> 
> Well, for me there would be two repositories (local and remote) and
> four branches: the remote master, an explicit mirror of master, my
> feature branch for publication, and my local branch for local-only
> configuration.

What about the sharing with colleagues? (Of configuration changes or new
features or fixes that aren't ready for upstream.) As I understand, in
your work-flow these are all either local branches in a repo in your
home dir, perhaps on your own computer. Or else, you push them for all
the world to see in a branch to the upstream repo. Both of which aren't
ideal IMO. You really want a third repo in between upstream and local
for that. In git this must be a bare repo, so you cannot and aren't
supposed to work in it, right?

>  > > Once or twice a week I use a refs/remotes/origin ref in a diff. Once
>  > > a quarter or less I need to look up the incantations for exposing a 
>  > > non-default remote branch in my local namespace.
> 
>  > Referring to any arbitrary remote should be just
>  > 
>  >   <remote URL> <separator> <branch>
>  > 
>  > without having to set up anything. That's my HO, at least.
> 
> Well, "origin" *is* an URI, relative to the local repository, if
> you're in one. 

This is a contradiction in terms. The 'U' in URI stands for 'universal'.

> As for set up, origin is just an alias.  If you want
> to fetch or pull with a full URL, you can do that.  Nobody I know
> does, though.

I did that a few days ago, because setting up the remotes correctly is
just too much hassle for me.

> Diffing you cannot do that way, because all diffing is done locally.
> This is true in all VCSes: you have to copy (download) the content to
> do the diff.  The difference with Bazaar and Mercurial, and I guess
> Darcs, is that git makes this explicit, and requires a local ref for
> diff.

Diffing is a completely separate story.

>  > It may be possible to make sense of it in Darcs by adding another
>  > kind of primitive patch for adding and removing subrepos, similar
>  > to (but distinguished from) adding/removing directories.
> 
> But this is quite analogous to what git does.  

Yes. I like to learn from git (in both ways).

> If a command doesn't
> require history metadata for its argument, then you can always use a
> tree or a commit that refers to that tree indifferently.  If a
> commit's *content* object is a commit, then git recognizes it as a
> subrepo, and stops (for most history-using commands), recurses (for
> the submodule command), or dereferences (for content-using commands).
> 
> It just happened that using a commit instead of designing a new object
> suggests pretty much exactly the semantics of "recursive DAG", which
> is an immediately plausible way to think about subrepos in git.

Whenever you call something "immediately plausible" in git, it feels to
me like we live on different planets. For instance, here you refer to "a
commit's *content* object" and I have only a vague idea what that is.
Neither do I understand "recursive DAG" or why you put it in quotes.
Your first paragraph could as well be in chinese.

You said earlier that git represents a submodule as a tree object that
is itself a commit. But it cannot be the commit that represents the
current (pristine) tree in the submodule, else I could not make a commit
in the submodule (or pull there) without makeing a commit in the
containing repo/branch. So the best it can be is the nominal version of
the submodule, as specified in the .gitmodules file, right?

>  > The ideal candidate for such an integration would be the
>  > experimental variant of primitive patch types where we assign UUIDs
>  > to all tree objects (files, dirs) as soon as they are created
>  > (recorded) for the first time, an idea I previously mentioned in
>  > passing. This gives them an identity independent from their name
>  > (path).
> 
> This is the role played by "blob" objects (for files) and "tree"
> objects (for directories) in git.  The UUID is just the SHA1 of the
> object's content.  You may prefer a true UUID such as <SHA of
> content>-<committer email>-<timestamp>,

It has nothing to do with preferences. The idea of the UUID is that it
remains invariant under mutation, so a hash just doesn't cut it, you
need some non-deterministic seed (the things you listed aren't enough to
ensure that, BTW, as they can be faked).

> but this would involve
> additional logic and communication to determine whether two objects
> with the same SHAs but different disambiguators reference the same
> object or not (you need to do a diff), and some algorithm for choosing
> one to give precedence to improve the chance that a "canonical" ID
> will propagate.  Linus decided to bet that SHA1s won't collide in his
> lifetime.

This is not the point. The objects/blobs/trees in git are immutable, so
a hash is exactly the right thing to use to reference them. The idea of
using UUIDs for file objects is to have a reference to a /mutable/
object. This is more like what an inode of a file is in Unix, but in
contrast to an inode, the UUID is not limited to a single file system
and has an endless life time. And just as inodes in a Unix file system
decouple the name (path) of the object (file) from the act of reading or
modifying it, so UUIDs are supposed to decouple hunks (changing the file
content) from changes in the tree that rename the file, or remove it or
whatever. (The analogy is fitting, as in Unix I can continue working
with a file or even a whole directory long after it was deleted by some
other process; deletion just removes the reference to the inode from the
directory.)

>  > This doesn't sound too bad, as a rough idea, but there are many, many
>  > details left to fill in.
> 
> As far as I can see, it's so analogous to the way git does things that
> a straightforward implementation will be, well, straightforward.  For
> serious use, you'll probably want to optimize, and that's always tricky.

True.

>  > > git has one, per directory in the working tree, *per commit*.
>  > 
>  > My first thought was that must cost a huge amount of disk space but of
>  > course that's not true since all identical objects are shared in the
>  > database, right?
> 
> Exactly.  Some people refer to git as a "filesystem" with SHA1s as
> inodes (and of course that's why the consistency check command is
> called "fsck").

In a Unix file system, the inode represents file identity. It does not
change when the file is mutated. This must be different in git, then,
since a hash can only refer to a specific version of the file. Does each
blob object contain a reference to its previous version(s), or is
tracking identity of files done only at the commit level?

>  > Our future (hopefully) UUID based patch theory could have an advantage
>  > here.
> 
> I don't see why patch theory itself would be UUID-based?

Perhaps a confusing terminology. Like most theories, patch theory has
multiple layers of abstraction. Whether the upper or the lower layers
are meant depends on context.

When you reason about things like how to represent conflicting changes
or how to merge them, you normally abstract over the "foundation", i.e.
the lower layers. You just use the "API", in other words, a set of
operations (like invert, commute, apply, etc) and /assume/ they observe
certain laws. (This is like reasoning about vector spaces where you
/assume/ some underlying field observing the usual field axioms.) You
then go and prove that whatever you are adding at the higher level
preserves these laws (and observes some additional ones). (In reality,
patch theory proofs are too hard, so we do next best thing: formulate
the properties as functions and test them with quickCheck...)

The UUID stuff is concerned with the foundation, that is, with what we
call "primitive" patches. We currently have hunks, add/remove file or
directory, replace, and setpref. These form just one possible
implementation of what the higher level theory assumes. The UUID based
primitive patches are an alternative foundation that allows commutation
to succeed more often.

I guess their main disadvantage is that it is harder to design a good UI
for them: you don't want users to have to use UUIDs to refer to files,
if possible. We can use the "current location" of a file/dir object but
that requires the object to be "manifest", meaning it actually has such
a location. But there are now also objects that are not manifest, so
there is no way to refer to them by name or path. I used to call them
"ghost files" and I guess this is how I will display them to the user
i.e. "ghost file 34f3e84ab765748" or something like that.

>  > > Described that way, sounds like it would make me nervous, too. :-)
>  > > On the other hand, in practice I generally have refs for things I
>  > > refer to,
> 
> The "nervous people" I'm talking about are frequently not considered
> exactly human and rarely are developers, and are worrying about content
> that they make the developers find and identify refs for: lawyers. :-
:-)))

Cheers
Ben



More information about the darcs-users mailing list