[darcs-users] ANN: hashed-storage 0.1
Petr Rockai
me at mornfall.net
Mon Jan 26 21:50:56 UTC 2009
Petr Rockai <me at mornfall.net> writes:
Hi once more!
So I have worked a little on hashed-storage again. Of course, I have let myself
be tempted to do something completely different than I previously intended. So
the result is an implementation of unified diffs on top of Igloo's lcs package
and the hashed storage. The gritty details of the implementation live in
Storage.Hashed.Diff for now, the driver (included with hashed-storage as
darcs-diff) looks like this:
main = do
pristine <- readDarcsPristine "." >>= unfoldAll
working <- readPlainTree "." >>= unfoldRestricted pristine
BL.unpack `fmap` diffTrees pristine working >>= putStr
Of course, it is completely oblivious of pending and lacks a fair amount of
features (like maybe --look-for-adds). However, the good news is that it
outperforms darcs by a relatively wide margin:
-- divine-1.3 (about 2300 files totalling around 40M):
darcs-diff .95s
darcs wh --ignore-times 5.94s
-- ghc-hashed
darcs-diff .4s
darcs wh --ignore-times 3.1s
-- darcs mainline
darcs-diff .12s
darcs wh --ignore-times 1.14s
So I thought, let's pit the haskell implementation of a plain old diff against
GNU diff. This is the plain-diff implementation:
main = do
[a', b'] <- getArgs
a <- readPlainTree a' >>= unfoldAll
b <- readPlainTree b' >>= unfoldAll
BL.unpack `fmap` diffTrees a b >>= putStr
I ran these:
$ darcs diff --diff-command="time -p diff -ru %1 %2"
$ darcs diff --diff-command="time -p plain-diff %1 %2"
and the results:
-- divine-1.3 (about 2300 files totalling around 40M):
plain-diff .47s
GNU diff .24s
-- ghc-hashed
plain-diff .18s
GNU diff .11s
-- darcs mainline
plain-diff .06s
GNU diff .03s
I think these times are entirely reasonable. Of course, this is a completely
unscientific and somewhat synthetic benchmark. GNU diff would probably tear us
completely apart on heavily modified trees. I had just a few modifications here
and there, if any at all. Still, this is an important use-case. The IO
efficiency matters here the most, not the diff implementation itself. I'm also
wondering why darcs wh takes such huge amounts of time... The difference
between plain-diff and darcs-diff (about 2-fold) is reading and decompression
of the hashed pristine tree. I hope to reduce that further in the no-diff case
even without mtime checks.
Of course, darcs wh is usually much better off, thanks to mtime-based
optimisation. However, I intend to implement that soon-ish, so we'll see how
the mtime-optimised versions bode against each other.
That should be it for today. I'll report back when I have some new results.
Yours,
Petr.
--
Peter Rockai | me()mornfall!net | prockai()redhat!com
http://blog.mornfall.net | http://web.mornfall.net
"In My Egotistical Opinion, most people's C programs should be
indented six feet downward and covered with dirt."
-- Blair P. Houghton on the subject of C program indentation
More information about the darcs-users
mailing list