[darcs-users] two questions about Darcs

Wed Aug 26 08:02:55 UTC 2009

I am a long-time user of Darcs, having begun to use it in either 2003
or 2004, and with very few exceptions I have been thoroughly pleased
by the experience of using Darcs, as well as by the responsiveness of
the Darcs team to my quibbles in #darcs.  Thank you!

Even though I have used Darcs for a long time, I am still merely a
novice user, and, bizarre as this may sound, I should like to stay
that way, because I believe that revision control is too fundamental
in software development to require expertise by the majority of its
users -- and Darcs has supported my belief marvellously.  However,
this may cause my questions and comments to seem naive and to use the
wrong terminology, for which I beg your pardon in advance.

I have two questions, prompted by Bryan O'Sullivan's two remarks about
Darcs in the article <http://queue.acm.org/detail.cfm?id=1595636>.
The first of his remarks was about the performance of Darcs:

`Why isn't everyone using Darcs, then?  For years, it had severe
 performance problems that made it completely impractical.  These have
 been addressed, to the point where it is now merely quite slow.'

I presume that by the `performance problems that made it completely
impractical', O'Sullivan meant the problem of Darcs's exponential-time
merging algorithm, the frequency of which problem I understand was
reduced tremendously in Darcs 2.  I am not sure what parts of Darcs
O'Sullivan meant to describe as `merely quite slow', but I have always
been frustrated by the performance of `darcs changes <pathname>' and
`darcs annotate <pathname>' in large repositories.

My understanding of src/Darcs/Commands/Changes.lhs suggests that
get_changes_info uses filter_patches_by_names to go through the entire
list of the repository's patches.  Similarly, in annotate_file in
src/Darcs/Commands/Annotate.lhs calls getMarkedupFile, whose auxiliary
routine do_mark_all also appears to go through the entire list of the
repository's patches.  This seems highly suboptimal -- for most uses
of the commands, surely they should run in time linearly proportional
to the number of patches related specifically in some way to the files
the user has passed to them.

After some discussion in #darcs a while ago (months, perhaps a year or
two), I believe Jason Dagit (lispy) told me that he had implemented
some kind of on-disk cache mapping pathnames to lists of patches that
could affect the files at those pathnames.  I don't remember his
details, but what prompted his mentioning that was my describing a
very conservative cache that would track all renames and identify any
files that ever had the same pathname.  E.g., if I had FOO renamed to
BAR, and then created another file FOO, and renamed BAR to BAZ, then
both these files would conservatively be assigned the same list of
patches.  Such a conservative cache is safe because its purpose is
only to shorten the list of patches to consider for each file, not to
identify it precisely, and the cache can always be rebuilt if lost.

In any case, irrespective of precisely how this cache is constructed,
will any such mechanism ever be included in Darcs to reduce the
frustration of waiting for `darcs changes <pathname>'?

The second remark made by O'Sullivan in his article was:

`Its more fundamental problem is that its theory is tricky to grasp,
 so two developers who are not immersed in Darcs lore can have trouble
 telling whether they have the same changes or not.'

I am emphatically not immersed in Darcs lore, but it has always been
my intuitive impression that if two repositories have no patches to
pull from or push to one another, then they have identical contents.
In #darcs, Simon Michael (sm) answered `yes' when I asked whether this
is true.  I didn't say exactly what I meant by `identical contents',
because, as I said, this is only an intuitive impression.  Obviously I
don't mean that *everything* is the same (e.g., the preferences), but
at least the state of the pristine tree and the collection of patches.

If this is so, then despite what O'Sullivan says, it seems unnecessary
to be immersed in Darcs lore in order to tell whether two repositories
have the same state, if one is reachable from the other: if `darcs
pull' followed by `darcs push' report no patches to transmit in either
direction, then the repositories have identical contents.  However,
this is a heavy-handed test, so I wonder:  Can the state of the
repository be summarized in a concise string such as a hash, or dumped
consistently, say, to a stream of bytes that can be piped to `openssl
dgst', so that if two repositories have identical contents, then they
will show the same summary (hash)?  What if I said `iff' instead of
`if'?

Thank you, and sorry for the long message!