[darcs-users] optimising darcs annotate

Ganesh Sittampalam ganesh at earth.li
Fri Oct 24 05:15:50 UTC 2008


On Thu, 23 Oct 2008, David Roundy wrote:

> On Thu, Oct 23, 2008 at 10:29:37PM +0100, Ganesh Sittampalam wrote:
>> Hi,
>>
>> I've been thinking in detail about how to optimise darcs annotate, which
>> I hope to do at the sprint this weekend. I'd appreciate comments on my
>> current plan, which is not yet fully fleshed out.
>>
>> - The basic idea is to keep track of what patches might touch what files,
>> as has been suggested in past discussions about this topic.
>>
>> - In each repo, maintain a local numbering of all the patches in the
>> repo. This will facilitate referring to the patches efficiently. If a
>> patch is deleted or commuted, there is no need to remove it from the
>> numbering, so this might end up as a superset of the patches currently
>> present.
>
> Why not just use the hashes that we already store (i.e. in a hashed
> repository)? I don't see what an additional layer of indirection will
> gain us except a headache trying to keep everything synchronized.

> I'll repeat what I mentioned above:  it's faster and better to refer
> to patches by hash than by number.  It takes a bit more space, but
> that shouldn't be a significant downside, and the upside is that you
> have easy O(1) lookup of patch name and contents, and potentially a
> human-readable database (assuming the humans don't mind grubbing
> around in _darcs/patches/).

The hashes are far bigger than a 32-bit int, so it'll be a lot more space 
(proportionately), which I think would be enough to matter on large repos 
with large patches. I would also expect it to be significantly slower to 
read the larger file, which I would expect to outweigh the time cost of 
looking up the numbers later.

I'm not sure how patch hash gives O(1) lookup of contents - filesystems 
are generally O(n) or O(log n) in each directory, aren't they? I'd expect 
an IntMap lookup to also be very fast.

I was planning on putting all the code to maintain the mapping in one 
place, thus reducing (though not eliminating) the headache.

In view of your objections I'll either start with hashes and see how that 
behaves, or make it easy to switch the code so we can test out both ways.

>> - Maintain equivalence classes of (a) filenames and (b) directory
>> names.  I still haven't quite worked out the details of this, but
>> the basic idea is that two names are related if there's ever been a
>> move operation from one to the other in the repo history (again,
>> supersets are allowed), and then we take the
>> reflexive-symmetric-transitive closure to make the equivalence
>> relation. We need to track directories and files separately because
>> a directory move can implicitly cause file moves, and commutation
>> means that we can't be sure what the complete set of affected files
>> is at the point of the move.
>
> A far simpler approach would be to only optimize for currently-existing 
> files and directories, which is the common use case.  Then we have a 
> very simple structure that's easy to update and use, and we can fall 
> back to the existing code for removed files or directories.

I did think about that, but I had the impression from zooko's email (in 
response to my previous questions about darcs annotate earlier this week) 
that going back in the history is a significant use case.

Also, just optimizing for the current would basically mean starting again 
if we did ever want to optimize for the history too.

The downside of this approach is that if there are lots of operations 
where a file is renamed and then a new file with the old name created, 
things will go slower.

> Alternately, you could maintain a database using a repository-specific
> unique ID for each file, which would just be its original filename and
> the ID of the patch that created it, and then we'd just need a mapping
> from current filenames to file IDs and the patch database could be
> keyed on the unique ID, so that for any file you'd be able to look up
> precisely which patches affected that file.

Yes, that sounds good if we don't care about going back in the history.

>> - Maintain a mapping from the equivalence classes to sets of patches
>> (referred to by the numbering). A patch is in the relevant set if it
>> touches any member of the equivalence class.
>
>
>> - Implement some kind of restricted repository/patch reader (perhaps by
>> adding parameters to the existing code) that is constrained to a set of
>> patches and a set of names. This should run much quicker than a full
>> repository read, but be equivalent from the point of view of the annotate
>> code itself if given an appropriate restriction based on the file the
>> user wanted to be annotated. The equivalence classes should ensure that
>> we don't care if the user specifies a patch as well as a file.
>
> Right.  Okay, I think I'm beginning to understand now.  These
> equivalence classes make your database more awkward to implement and
> slower to use, but allow us to reuse more of the crufty old annotate
> code... and to have a single code path, which is also nice.

I don't mind rewriting the annotate code if necessary, but I think one 
code path is important for maintainability. I did think about just having 
a single fast path, but concluded I'd have to copy/rewrite quite a bit of 
existing code to do so.

>> - I'm planning on using Data.Binary for persisting the structures
>> described above. Probably with a version number at the beginning for any
>> backwards compatibility issues.
>
> I'd definitely prefer not to do this.  It stinks of premature 
> optimization.  Why not just use the same sorts of data structures we're 
> already using? Because the profile All you really need to do to improve 
> annotate is to change its scaling, and we've already got all the data 
> structures in place that you need, so far as I can tell.

Firstly, profiling annotate shows that it sepends a large amount of its 
time reading and gunzipping text 
(http://urchin.earth.li/~ganesh/temp/darcs.prof). This won't be anywhere 
near as significant for reading this database, as it's just one file, but 
it does suggest that text formats come with a huge overhead.

Secondly, Data.Binary is actually easier to use for maps and sets which 
are the in-memory structures I plan to use, since instances for them 
already exist in the binary package, so I wouldn't really call this 
premature optimization; I don't particularly want to design a text file 
format and write and debug a parser for it.

> I would far prefer a text format akin to the hashed repository format.
> It's easy to atomically update, and it's easy to work with.  And it
> would enable features like running darcs annotate on a remote
> repository--which could be extremely stupid, but could also be
> convenient--or lazily grabbing this database when doing a darcs get.
> This latter possibility would be very nice, as it would mean you could
> run annotate on a huge repository without downloading all the
> patches.

Why do binary files preclude any of this?

The one thing that I do agree would be more convenient is human 
inspection/hacking of the file, as you say above.

Anyway, I don't anticipate it being a big problem to change this before 
the code goes into darcs, apart from the actual work of writing the text 
file format and parser, so I'll start with the easy solution for me and we 
can take it from there.

Ganesh


More information about the darcs-users mailing list