[darcs-users] Historical Versions (was: GSoC: network optimisation vs cache vs library?)

Max Battcher me at worldmaker.net
Thu Apr 15 05:26:42 UTC 2010

On 4/14/2010 22:49, Isaac Dupree wrote:
> On 04/14/10 20:18, Max Battcher wrote:
>> All of which goes to show that Trac+darcs still isn't well optimized for
>> caching darcs queries or dealing gracefully with with long running
>> command invocations... I still say the Trac reliance on CVS/SVN-style
>> revision numbers means that Trac is absolutely not well-adapted for
>> serving darcs repositories. It may be "revision 1782" to Trac, but 'show
>> contents --match "hash 2008..."' is "commute this file to how it would
>> appear if only the patches preceding or equal to this one with a
>> timestamp from two years ago were applied" to darcs. (Which ends up
>> being quite possibly not a "real" historic version at all,
> Well, suppose you have a public darcs repository for a project. (Such as
> GHC HEAD.) If you look at the history of the real world (as opposed to
> darcs' conception of history), this repo contained a series of states
> over time. What infrastructure would we need, to be able to look at this
> series usefully/efficiently years later? (I am reckoning that this
> concept of history is useful enough that it's worth creating whether or
> not darcs itself can support it. Does anyone agree/disagree?)

I've put an odd amount of thought into that over the years, and I've 
also wondered how important it might be in reality... Different 
developers will probably disagree on which bits are important, and I 
think some of those philosophical differences are precisely the same 
reasons why git and darcs (for instance) can co-exist because developers 
may continue to prefer one approach over the other...

First of all, darcs does have one concept of real world history that 
already is critical in many areas to darcs performance: the TAG. If 
there is an important point in a repository's history, it should be 
named and tagged. I can see a distinct case for specifically making sure 
that any/all operations --to-tag/--from-tag are as performant as 
possible. I could also see a case for some sort of (possibly opt-in) 
auto-caching system for tag states (pristines).

Beyond that, darcs itself doesn't have any knowledge of "real world 
history"... It doesn't track which patch was pulled/pushed in, only when 
the patch was originally written (according to the clock of the system 
on which the patch was written). This makes sense to darcs due to the 
"fluidity" of patch movement (thanks to cherry-picking) and potential 
complexity. (Should darcs try to record the integration history of a 
patch across every branch/repository that patch has ever seen/will ever 
see? How do you merge "conflicting" integration histories? How controls 
it? How do you keep it secure?) Darcs admittedly takes the easiest 
possible approach, which is: don't worry about it.

Is that the correct approach? Maybe. Assuming valid timestamps all 
around and adequate tagging darcs' commutation-based conception of 
history is a close enough approximation to real history to help a 
patient human find what they are looking for. (Certainly not a close 
enough solution to make "every version" available via direct HTTP GET 
requests to darcs commands, but on the order of a file system search for 
a human performing a query, for example.)

Assuming that you do critically need/want more historic version 
information cached/saved... Here's something of the possibility spectrum:

* The "pig-in-a-blanket" repository: store a darcs repository inside a 
git (or svn, or whatever) repository. It sounds silly, but its not all 
that different than using some of the "patch queue" tools that 
git/hg/svn users already use... you're just using darcs as a more 
powerful patch queue and git (or whatever) as the fastest, dumb "store 
the state of lots of files at each moment in time that I designate" file 
store that you can find and trust. (Slightly less crazy variations might 
be to take use directly of a distributed block store like S3, HDFS, or 
even a document database...)

* Context-generating pre/posthook: before/after history manipulating 
commands (apply, pull, record, amend-record, ...) something like:

   darcs changes --context > archive.`date %s`.context

   That's the basics you would need to keep track of actual, real-world 
historical states. Although, you'd probably want to compress the context 
files together for more long term storage, or find some more capable 
storage engine. From the generated context files you should be able to 
recreate all of the actual historical states. (Unfortunately it may not 
be as performant or capable as it should be, because context files need 
a bit of love...)

* In-repo branching: There's a long thread on the subject, but the 
basics are that the hashed-storage backend could easily store more than 
one inventory/pristine state in the same repository. Theoretically you 
could build a third-party tool to handle multiple "root pointers" and 
then "hold onto" root pointers for historic versions so that those files 
don't get garbage collected. (This is sort of an inversion of the 
"pig-in-the-blanket" idea: use darcs' own current data storage backend 
(hashed-storage), but encourage it/tune it to store more than darcs 
alone does.)

* Propose a useful interaction pattern for darcs optionally to track 
such things itself and help it get implemented. Certainly, the toughest 
path, but it may be possible for someone to come up with a good plan of 
attack that darcs could implement directly.

That's how you might go about doing it... I personally don't see a need 
for it. I think there are more interesting tools that solve similar 
problems that could be improved first: better/stronger interactive file 
annotation/blame tools; better/stronger darcs trackdown; tools that 
maybe we don't even have names for today. I think it does come down to a 
matter of different lifestyles for different DVCS tools: darcs' "bag" of 
patches != git's DAG of file states.

In most of my development workflow, when I care about historical states, 
I care about 1) tag file states, and 2) individual patch deltas... 
historical integration states in between the two are much less common 
for me to seek out. Both (1) and (2) are easy enough to get from 
darcs... But that's just my approach and I appreciate that other 
developers will disagree on this.

Hopefully some of the above is useful,

--Max Battcher--

More information about the darcs-users mailing list