[darcs-users] optimising darcs annotate

Thu Oct 23 22:35:11 UTC 2008

On Thu, Oct 23, 2008 at 10:29:37PM +0100, Ganesh Sittampalam wrote:
> Hi,
>
> I've been thinking in detail about how to optimise darcs annotate, which 
> I hope to do at the sprint this weekend. I'd appreciate comments on my  
> current plan, which is not yet fully fleshed out.
>
> - The basic idea is to keep track of what patches might touch what files, 
> as has been suggested in past discussions about this topic.
>
> - In each repo, maintain a local numbering of all the patches in the 
> repo. This will facilitate referring to the patches efficiently. If a 
> patch is deleted or commuted, there is no need to remove it from the 
> numbering, so this might end up as a superset of the patches currently 
> present.

Why not just use the hashes that we already store (i.e. in a hashed
repository)? I don't see what an additional layer of indirection will
gain us except a headache trying to keep everything synchronized.

> - Maintain equivalence classes of (a) filenames and (b) directory
> names.  I still haven't quite worked out the details of this, but
> the basic idea is that two names are related if there's ever been a
> move operation from one to the other in the repo history (again,
> supersets are allowed), and then we take the
> reflexive-symmetric-transitive closure to make the equivalence
> relation. We need to track directories and files separately because
> a directory move can implicitly cause file moves, and commutation
> means that we can't be sure what the complete set of affected files
> is at the point of the move.

A far simpler approach would be to only optimize for
currently-existing files and directories, which is the common use
case.  Then we have a very simple structure that's easy to update and
use, and we can fall back to the existing code for removed files or
directories.

Alternately, you could maintain a database using a repository-specific
unique ID for each file, which would just be its original filename and
the ID of the patch that created it, and then we'd just need a mapping
from current filenames to file IDs and the patch database could be
keyed on the unique ID, so that for any file you'd be able to look up
precisely which patches affected that file.

> - Maintain a mapping from the equivalence classes to sets of patches  
> (referred to by the numbering). A patch is in the relevant set if it  
> touches any member of the equivalence class.

I'll repeat what I mentioned above:  it's faster and better to refer
to patches by hash than by number.  It takes a bit more space, but
that shouldn't be a significant downside, and the upside is that you
have easy O(1) lookup of patch name and contents, and potentially a
human-readable database (assuming the humans don't mind grubbing
around in _darcs/patches/).

> - Implement some kind of restricted repository/patch reader (perhaps by  
> adding parameters to the existing code) that is constrained to a set of  
> patches and a set of names. This should run much quicker than a full  
> repository read, but be equivalent from the point of view of the annotate 
> code itself if given an appropriate restriction based on the file the 
> user wanted to be annotated. The equivalence classes should ensure that 
> we don't care if the user specifies a patch as well as a file.

Right.  Okay, I think I'm beginning to understand now.  These
equivalence classes make your database more awkward to implement and
slower to use, but allow us to reuse more of the crufty old annotate
code... and to have a single code path, which is also nice.

> - I'm planning on using Data.Binary for persisting the structures  
> described above. Probably with a version number at the beginning for any  
> backwards compatibility issues.

I'd definitely prefer not to do this.  It stinks of premature
optimization.  Why not just use the same sorts of data structures
we're already using? All you really need to do to improve annotate is
to change its scaling, and we've already got all the data structures
in place that you need, so far as I can tell.

I would far prefer a text format akin to the hashed repository format.
It's easy to atomically update, and it's easy to work with.  And it
would enable features like running darcs annotate on a remote
repository--which could be extremely stupid, but could also be
convenient--or lazily grabbing this database when doing a darcs get.
This latter possibility would be very nice, as it would mean you could
run annotate on a huge repository without downloading all the
patches.

David