[darcs-users] hashed-storage and darcs-diff

Wed Feb 11 00:00:28 UTC 2009

Hello fellow hackers!

I have worked a little more on hashed-storage again in last few days. It now
manages the index "semi-automatically". What this means is that it will keep
the hashed index up to date with any file changes whenever needed. It however
won't be able to notice that files were added or removed from version control,
since to efficiently implement this, it would need help from darcs.

I will probably wrap this up in a standalone program and try to distribute it
to a wider community to collect benchmark results. It doesn't make sense to
pursue this if it doesn't make things faster for real world use. Ideally, I'd
try to get Simon wrt http://bugs.darcs.net/issue1202 try it out in his setting,
it would be interesting to know if it helps there.

In other news, I have started adding unit tests, so the repository now has a
few checks in it. It also checks against darcs, so we can check that it behaves
the same as current darcs on the file access level. I don't expect that the
code would introduce correctness bugs into darcs -- the implementation is more
of a performance issue than correctness. The code is fairly simple and is
generally straightforward with little variation in different runs. I'll look
into adding more cases of course, and probably also some test data, so I can
experiment with working copies and such.

In yet other news, I have also done some bigger benchmark, this time on a
repository with 80k files. Extending my pre-existing 40k repository with new
40k files, 2000 files per patch, took about half an hour with darcs (probably
on order of minutes with git). Most of this time has been apparently spent in
code dealing with pristine (again). Anyway, the current status is that
darcs-diff is about 10 times faster than darcs whatsnew and within 70 % slower
than git diff. I don't think there's much leeway here, since almost all the
extra time darcs-diff uses (compared to git diff), as far as I can tell, is in
Data.Binary. We would have to somehow use the mmap'd index files directly to
cut down this cost, instead of building a haskell data structure out of it (a
list of tuples) with Data.Binary.

I could try to implement a Storable interface instead, which should eliminate
at least part of that cost, although it's not very high priority just
now. Also, the code is confined to a single module, currently worth 170 lines.

To move further in the direction of directly benefitting darcs, I will take on
the TreeIO monad next, and then probably darcs integration right away. These
two combined should make it possible to significantly speed up: whatsnew,
record, revert, diff (through diffing improvements) and pull, apply (through
TreeIO). It would also simplify check and repair, since those currently use an
ad-hoc version of what TreeIO is supposed to solve more elegantly.

This also means I'm deferring work on a better, packed repository format that
would speed up darcs get and remote pull. This latter will be much more
intrusive, and is tied to many other areas of darcs, like eg. cache-ing. It
will also require repository format conversion, so we should get it right this
time. The repository format will stay compatible on the patch-level, so it's
sort of like darcs-1 -> hashed conversion. The other kind (darcs-1 -> darcs-2)
will probably come if we migrate to a different patch format. (Probably camp's,
to get rid of exponential commutes? But that's a long time from now, we'll see
where things wander...)

Yours,
   Petr.

-- 
Peter Rockai | me()mornfall!net | prockai()redhat!com
 http://blog.mornfall.net | http://web.mornfall.net

"In My Egotistical Opinion, most people's C programs should be
 indented six feet downward and covered with dirt."
     -- Blair P. Houghton on the subject of C program indentation