[darcs-users] optimising darcs annotate

Benedikt Schmidt Benedikt.Schmidt at inf.ethz.ch
Sun Oct 26 23:36:58 UTC 2008


Ganesh Sittampalam <ganesh at earth.li> writes:

> It turns out that Benedikt Schmidt has some part-done work on this from a 
> year or two ago that he is now going to port to unstable and finish off, 
> so I'm going to defer to him.

I have made my repository with the current version of the code
available at http://shim.haskellco.de/darcs-filecache/.
For now it only supports creating the filecache from scratch
and using the filecache to speed up 'darcs changes file'.
For testing, e.g. in the darcs repo, you can create a filecache
with 'darcs optimize --filecache'. Here are some measurements
I just did in my version of the unstable repo.

$ time ./darcs optimize --filecache
real	0m37.514s
user	0m11.154s
sys	0m5.446s
$ du -hs _darcs
43M	_darcs/
$ du -hs _darcs/filecache
4.6M	_darcs/filecache/

|                                       | hot cache | cold cache |
| darcs changes GNUmakefile             | 3.717s    | 12.113s    |
| darcs changes --filecache GNUmakefile | 0.842s    | 3.568s     |
  
The current implementation of changes only filters the relevant patches
from the complete set of patches in the repo, but the code still reads
every patch that touches the file. Since the patchinfo is in the
inventory and the information which patch touches which file, changes
FILE should get by without reading any patches.

The next step is querying the filecache for other commands like annotate
and updating the filecache on record, pull, obliterate, and so on.

In case someone is interested in the data- and on-disk structures:

Optimize --filecache adds a filecache with a filecache-inventory file
that contains the patchids (pinfo hash) of all patches in the inventory
in the same order as the inventory.  The mapping from file names to
patches that modify these file is stored in the filemap directory. For
every filename that has ever existed in the repo, there is a
modification file named hash(relative_path_file) with an ordered list of
modifications (newest last): * Touch patchid * Removed patchid *
RenamedTo newfilehash index * RenamedFrom oldfilehash index Touch and
Removed should be self explanatory, RenamedTo and RenameFrom are forward
and backward pointers into other modification files when a file has been
renamed.  For example, to find all patches that touch the file FOO, we
open _darcs/filecache/filemap/<hash(FOO)> and add all Touch entries
starting from the end until we arrive at an entry of the other three
types. If it is an Removed or RenamedTo entry, we stop since the file
was created by the last Touch entry and the entries before the Remove or
RenamedTo refers to an earlier file with the same name. If there is an
RenamedFrom oldfilehash index entry, we knows that the file has been
renamed from oldfile to FOO and we can read about the patches that
touched oldfile in oldfilehash starting from index (reading backwards
again).

For keeping the filehash up-to-date we need the following
operations:
 * append a modification entry to one of modification files
 * remove all entries that refer to a certain patch from
   the modification files

Then we can handle arbitrary changes without recreating the filecache
from scratch. After modifying the repo from
oldstate =compref++oldpatches to newstate = compref++newpatches,
we remove all patches in oldstate and append all entries for patches
in newpatches.

For removing the patches, it would be nice to also have a mapping from a
patch to the set of files it modifies (or even an index into the file)
so that scanning all modification files for the patch that has to be
remove is not necessary.

  Benedikt



More information about the darcs-users mailing list