[darcs-users] optimising darcs annotate

Fri Oct 24 13:51:09 UTC 2008

On Fri, Oct 24, 2008 at 1:15 AM, Ganesh Sittampalam <ganesh at earth.li> wrote:
>> I'll repeat what I mentioned above:  it's faster and better to refer
>> to patches by hash than by number.  It takes a bit more space, but
>> that shouldn't be a significant downside, and the upside is that you
>> have easy O(1) lookup of patch name and contents, and potentially a
>> human-readable database (assuming the humans don't mind grubbing
>> around in _darcs/patches/).
>
> The hashes are far bigger than a 32-bit int, so it'll be a lot more space
> (proportionately), which I think would be enough to matter on large repos
> with large patches. I would also expect it to be significantly slower to
> read the larger file, which I would expect to outweigh the time cost of
> looking up the numbers later.

It's still a small fraction of the space, and the key to reasonable
optimization is to reduce the amount of information that you need.

> I'm not sure how patch hash gives O(1) lookup of contents - filesystems are
> generally O(n) or O(log n) in each directory, aren't they? I'd expect an
> IntMap lookup to also be very fast.

That's a good point.  But then again, you have to read the patch file
in either case, so using the hash has O(1) overhead, whereas looking
up an index has O(log N) overhead (assuming you use an in-memory data
structure listing all the patch IDs), plus O(N) overhead (but not
per-patch) to read the file (assuming you keep your index mapping in a
single file).  Since we're trying to remove the O(N) cost, that
doesn't seem like such a hot idea.

Another important consideration is that it'd be best to avoid making
record (or amend-record, rollback, obliterate, etc) O(N) in the number
of patches in the repository.  It currently is O(N) on hashed
repositories (but not old-fashioned repositories), and is therefore
very slow, and that's something I'd like to fix, and it'd be nice to
avoid adding more data that keeps it O(N) (e.g. to avoid updating a
single file that references every patch in the repository).

> I was planning on putting all the code to maintain the mapping in one place,
> thus reducing (though not eliminating) the headache.
>
> In view of your objections I'll either start with hashes and see how that
> behaves, or make it easy to switch the code so we can test out both ways.

Sounds good to me.  Unfortunately, I haven't figured out a nice layout
that avoids being O(N).  One option that would have this effect, of
course, would be to turn everything on its side, and instead of
storing a list of patches that affect each file (or equivalence
class), store for each patch a list of files affected.  This would
mean that annotate would still be O(N) in the total number of patches,
but we could avoid parsing any patches that don't affect the
file/directory we're annotating, so it's likely to be a huge
improvement in any case.  And it would mean that we could use the same
broken-inventory structure we already use for inventories, so that
making record O(N) would involve no extra work.

>>> - Maintain equivalence classes of (a) filenames and (b) directory
>>> names.  I still haven't quite worked out the details of this, but
>>> the basic idea is that two names are related if there's ever been a
>>> move operation from one to the other in the repo history (again,
>>> supersets are allowed), and then we take the
>>> reflexive-symmetric-transitive closure to make the equivalence
>>> relation. We need to track directories and files separately because
>>> a directory move can implicitly cause file moves, and commutation
>>> means that we can't be sure what the complete set of affected files
>>> is at the point of the move.
>>
>> A far simpler approach would be to only optimize for currently-existing
>> files and directories, which is the common use case.  Then we have a very
>> simple structure that's easy to update and use, and we can fall back to the
>> existing code for removed files or directories.
>
> I did think about that, but I had the impression from zooko's email (in
> response to my previous questions about darcs annotate earlier this week)
> that going back in the history is a significant use case.

It depends what you mean by "going back in the history".  Of course,
the whole point of annotate is to go back in history, but the question
is how common is it to go back and look at deleted files.

> Also, just optimizing for the current would basically mean starting again if
> we did ever want to optimize for the history too.

I mean to say optimizing for the currently-existing files.  I think in
almost any project that will be the majority of the files (and the
currently-most-interesting ones, usually).  But your point is fair
that optimizing for deleted files would be a separate job.

> The downside of this approach is that if there are lots of operations where
> a file is renamed and then a new file with the old name created, things will
> go slower.

Right.

>> Right.  Okay, I think I'm beginning to understand now.  These
>> equivalence classes make your database more awkward to implement and
>> slower to use, but allow us to reuse more of the crufty old annotate
>> code... and to have a single code path, which is also nice.
>
> I don't mind rewriting the annotate code if necessary, but I think one code
> path is important for maintainability. I did think about just having a
> single fast path, but concluded I'd have to copy/rewrite quite a bit of
> existing code to do so.

Yes, I think you're right.

>>> - I'm planning on using Data.Binary for persisting the structures
>>> described above. Probably with a version number at the beginning for any
>>> backwards compatibility issues.
>>
>> I'd definitely prefer not to do this.  It stinks of premature
>> optimization.  Why not just use the same sorts of data structures we're
>> already using? Because the profile All you really need to do to improve
>> annotate is to change its scaling, and we've already got all the data
>> structures in place that you need, so far as I can tell.
>
> Firstly, profiling annotate shows that it sepends a large amount of its time
> reading and gunzipping text
> (http://urchin.earth.li/~ganesh/temp/darcs.prof). This won't be anywhere
> near as significant for reading this database, as it's just one file, but it
> does suggest that text formats come with a huge overhead.

That's because it spends all its time reading patch files, which is
what you're optimizing.

> Secondly, Data.Binary is actually easier to use for maps and sets which are
> the in-memory structures I plan to use, since instances for them already
> exist in the binary package, so I wouldn't really call this premature
> optimization; I don't particularly want to design a text file format and
> write and debug a parser for it.

But I think that using sets or maps in-memory is going to be a
horrible pessimization, if your sets or maps are going to refer to all
the patches in the repository.  It'll mean that there's no way to use
your code and keep record from being O(N).

>> I would far prefer a text format akin to the hashed repository format.
>> It's easy to atomically update, and it's easy to work with.  And it
>> would enable features like running darcs annotate on a remote
>> repository--which could be extremely stupid, but could also be
>> convenient--or lazily grabbing this database when doing a darcs get.
>> This latter possibility would be very nice, as it would mean you could
>> run annotate on a huge repository without downloading all the
>> patches.
>
> Why do binary files preclude any of this?

I presumed that you were going to store large binary files, rather
than many small ones referenced by hash.

> The one thing that I do agree would be more convenient is human
> inspection/hacking of the file, as you say above.
>
> Anyway, I don't anticipate it being a big problem to change this before the
> code goes into darcs, apart from the actual work of writing the text file
> format and parser, so I'll start with the easy solution for me and we can
> take it from there.

We already have the code to write the text formats and parsers, unless
you're planning on introducing new ones.  I like the idea of reusing
code.

On thinking this over, it seems likely that the best plan would be to
use the transposed approach where you associate each patch with a
number of file and directory names.  This keeps patch creation cheap,
and although it means we still need to read and scan through this
entire list (back in history to the time the file we're interested in
was created), this should still gain us three or so orders of
magnitude in efficiency, since we're greatly reducing the number of
patches we need to read.  It has the advantage of allowing us to
easily acheive your equivalence-class feature in terms of the number
of files read, and of having a single code path through annotate.  And
since any annotate that relies on a lookup table for patch indices
will also be O(N), we haven't lost anything in scaling.

David