[darcs-users] ANN: hashed-storage 0.1

Petr Rockai me at mornfall.net
Mon Jan 26 21:50:56 UTC 2009


Petr Rockai <me at mornfall.net> writes:
Hi once more!

So I have worked a little on hashed-storage again. Of course, I have let myself
be tempted to do something completely different than I previously intended. So
the result is an implementation of unified diffs on top of Igloo's lcs package
and the hashed storage. The gritty details of the implementation live in
Storage.Hashed.Diff for now, the driver (included with hashed-storage as
darcs-diff) looks like this:

    main = do
      pristine <- readDarcsPristine "." >>= unfoldAll
      working <- readPlainTree "." >>= unfoldRestricted pristine
      BL.unpack `fmap` diffTrees pristine working >>= putStr

Of course, it is completely oblivious of pending and lacks a fair amount of
features (like maybe --look-for-adds). However, the good news is that it
outperforms darcs by a relatively wide margin:

-- divine-1.3 (about 2300 files totalling around 40M):

   darcs-diff                .95s
   darcs wh --ignore-times  5.94s

-- ghc-hashed

   darcs-diff                .4s
   darcs wh --ignore-times  3.1s

-- darcs mainline

   darcs-diff               .12s
   darcs wh --ignore-times 1.14s

So I thought, let's pit the haskell implementation of a plain old diff against
GNU diff. This is the plain-diff implementation:

    main = do
      [a', b'] <- getArgs
      a <- readPlainTree a' >>= unfoldAll
      b <- readPlainTree b' >>= unfoldAll
      BL.unpack `fmap` diffTrees a b >>= putStr

I ran these:

    $ darcs diff --diff-command="time -p diff -ru %1 %2"
    $ darcs diff --diff-command="time -p plain-diff %1 %2"

and the results:

-- divine-1.3 (about 2300 files totalling around 40M):

   plain-diff .47s
   GNU diff   .24s

-- ghc-hashed

   plain-diff .18s
   GNU diff   .11s

-- darcs mainline

   plain-diff .06s
   GNU diff   .03s

I think these times are entirely reasonable. Of course, this is a completely
unscientific and somewhat synthetic benchmark. GNU diff would probably tear us
completely apart on heavily modified trees. I had just a few modifications here
and there, if any at all. Still, this is an important use-case. The IO
efficiency matters here the most, not the diff implementation itself. I'm also
wondering why darcs wh takes such huge amounts of time... The difference
between plain-diff and darcs-diff (about 2-fold) is reading and decompression
of the hashed pristine tree. I hope to reduce that further in the no-diff case
even without mtime checks.

Of course, darcs wh is usually much better off, thanks to mtime-based
optimisation. However, I intend to implement that soon-ish, so we'll see how
the mtime-optimised versions bode against each other.

That should be it for today. I'll report back when I have some new results.

Yours,
   Petr.

--
Peter Rockai | me()mornfall!net | prockai()redhat!com
 http://blog.mornfall.net | http://web.mornfall.net

"In My Egotistical Opinion, most people's C programs should be
 indented six feet downward and covered with dirt."
     -- Blair P. Houghton on the subject of C program indentation


More information about the darcs-users mailing list