[darcs-users] soc final report

Petr Rockai me at mornfall.net
Sun Aug 16 09:03:10 UTC 2009

(A little delayed compared to blog, but alas, I forgot -- Eric reminded me

soc final report

As you may have noticed, I skipped last progress report. That's because I was
busy with coding work. Anyway, let's summarise what happened throughout the
GSoC program. The basic idea was to improve darcs support for large
repositories -- large in terms of number of working files. The primary goal was
to improve the code that diffs the working tree against the pristine cache: an
operation, that is very common, and in darcs 2.2 and earlier was also painfully
slow (depending on circumstances).


So what has been delivered? There is a number of outputs of the project:

- hashed-storage 0.3.7, a library for reading and writing filesystem trees in
  hash-based formats, with primary focus on darcs hashed pristine
- darcs whatsnew integration with hashed-storage 0.3.7, part of the current
  stable release of darcs, 2.3.0
- hashed-storage almost-0.4 -- a work in progress, improving hashed-storage
  0.3.7 in several directions (and breaking the API)
- darcs-hs: a branch integrating hashed-storage 0.4 into darcs, using
  hashed-storage fast diffing mechanism for all working/pristine diffing needs
  -- naturally, also a work in progress
- darcs-benchmark: a standalone package for benchmarking darcs (pitting
  different versions of darcs against each other on a number of medium-sized
  repositories, on a number of different commands)

The first two items on the list are done for good. The diffing code that was a
major goal of the project is stable and part of a stable darcs release. It is
supported by a independent hackage library, hashed-storage.

For efficient retrieval over HTTP... there is support for creating and using
hashed packs in hashed-storage 0.4. This still needs some work to be put in
practice, both on darcs-hs and on hashed-storage. I will continue working on
this in the fall. The current status of code in hashed-storage 0.4 is basically
proof-of-concept (comes with unit tests though). There is also a concept
document, [[designing_storage_for_darcs]]. Either way, I am trying to not rush
things on this, since once we put this into the stable darcs, we will have to
live with it for a long time, and mistakes now would be expensive later.

As for darcs-benchmark -- I have uploaded version 0.1 to hackage and it should
be ready for wider circulation. Please try it out and let me know if you have
any issues with it (`cabal install darcs-benchmark`).

Work in Progress

Since pencils down is approaching quickly, there will stay a number of
in-progress items. There are still some bugs to fix in darcs-hs, and
hashed-storage needs a haddock cleanup and at least one pass over the API
before I can upload 0.4. But let's see what 0.4 brings over 0.3.7. During a
review process that Ganesh started, we have identified a number of weak spots
in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.

On the API level, we have generalised the Tree out of IO into a generic Monad,
to help with testing infrastructure. The Index API has been simplified and,
more importantly, made much safer. With this done, it turned out that the index
reading code is needlessly general, and I have simplified it, improving index
performance along the way (in the process, I also changed the
layout... twice... the magic word versioning mechanism really paid off).

I have also refactored the Hash type, and removed all the hacks that deal with
the size prefix on darcs hashes. This means we won't be able to write out new
trees of this kind anymore (but no big loss, old darcs can read repositories
without the size prefix as well). Along with the original Hash refactor, I have
changed the internal hash representation, cutting hash length from 64 to 32
bytes (this is still SHA256, but is using full 8 bits of every byte, unlike
ascii-hex which carries 4 bits of useful data per byte). This representation is
also used in the index.

The overall result is, that the index size has shrunk considerably -- it is
now, on average, smaller than a git index for equivalent repository, even
though we use stronger hashes (but yes, git index is more than a simple cache).

The even more interesting outcome is that, with darcs-hs, I can now run
"whatsnew" on a 100k-file repository in 0.35 seconds -- a mere .1 second (or
40%) slower than equivalent git diff. With 100% Haskell (well, the sha256
implementation is in C, but this code is not tripped in the optimal case). For
most realistic repository sizes, the difference is going to be negligible. Oh,
and to get an idea how big a 100k-file repository is, the Linux kernel tree has
about 28k files in it.

Future Work

Recently, it has been noted, that for old-fashioned repositories, darcs
whatsnew has regressed in performance by a fair amount. This is true, but it's
not clear if it is worth addressing properly. For 2.3.1, the easy fix is to
restore the old code path when the repository is not hashed. However, I have
already removed unsafeDiff in darcs-hs and I want to keep it that way, so we
either need to start treating old-fashioned as second-class, or either come up
with a more systematic fix.

As for hashed-storage, I want to do further work on packs. Maybe even add code
to read git repositories. Moreover, there is an ongoing review of
hashed-storage which may bring further ideas for changes. We also have to make
some decisions about what to do about darcs in the future -- how to.


It has definitely been a productive summer. Lots of work has been done on
hashed-storage, and it seems that with a little further work, it can provide
solid underpinning to future darcs versions. Apart from major performance
improvements, it exposes a new standalone component with its own set of tests,
improving test coverage of what constitutes darcs.

More information about the darcs-users mailing list