[darcs-users] Some progress on hashed-storage.

Petr Rockai me at mornfall.net
Thu Jan 29 17:16:12 UTC 2009


Hi,

I have implemented first part of the hashed index. This is sort of "mtime
optimisation", although it works somewhat differently. The idea is to produce a
binary index for each directory of the working copy, containing hashes, names
and mtimes. A little like git.

Anyhow, I am stuck with any further optimisations due to darcs pristine
format. The fact it's hashed is great, but the directory formatting makes it
hard to hash efficiently. Moreover, to produce right kind of hashes, I would
need to include file sizes, which I currently don't keep (although I probably
should). Anyway, I have produced two synthetic benchmark repos:

many-files: 20 directories, 2000 files each
many-dirs: 20 directories, 20 subdirs each, 100 files per subdir

I have imported the two into both git and darcs. (Per-dir with darcs, and it
was still very painful... It was a lot better with git, although not
super-smooth either.) So the benchmark (darcs-diff has a pregenerated index for
these tests, but it does check its validity, like real VCS tool would):

(Btw., it seems to me that timestamps used by darcs mtime opt go out of sync
all the time. I had to run darcs rev on several occasions to get any reasonable
time from darcs wh. The rev operation also ate over 200M RAM and took over some
3 minutes, although with a cold cache... iew.)

many-files darcs wh: 2.7s
many-files darcs-diff: 1.1s
many-files git diff: .15s

many-dirs darcs: 2.5s
many-dirs darcs-diff: .75
many-dirs git: .2s

In other words, there are still ways to go. Coincidentally, 40k files are
roughly the size of the linux kernel repository. If we can get reasonable
performance (ie. rivalling git) on such trees, I am sure it would make many
people much more comfortable with darcs. I am currently at something like
factor 4 to 7 slower than git (depending on shape of the tree). I believe that
by solving the hashing strategy discrepancies between pristine.hashed and my
own index, I can maybe cut my times in half. (About half the time is currently
spent checking that the index is up to date... In theory, diffing pristine and
the index is very fast, once the index is known to be good.) This will also
probably need incompatible pristine format.

I'll report back again later.

Yours,
   Petr.

--
Peter Rockai | me()mornfall!net | prockai()redhat!com
 http://blog.mornfall.net | http://web.mornfall.net

"In My Egotistical Opinion, most people's C programs should be
 indented six feet downward and covered with dirt."
     -- Blair P. Houghton on the subject of C program indentation


More information about the darcs-users mailing list