[darcs-users] hashed repository issue

Petr Rockai me at mornfall.net
Tue Dec 9 09:21:12 UTC 2008


Hi,

Dan Pascu <dan at ag-projects.com> writes:
> Not to mention that a directory with 32k files makes things _very_ slow. 
> This flattened pristine directory may cause a considerable slowness of 
> darcs2 compared to darcs1 if the repository contains many files. The same 
> can be said for the flattened patches directory, but that is a problem 
> common to both of them. Still as the number of patches increases, things 
> will get gradually slower.

this is the n-th time this argument flies by, so I figured it was time to
quantify the problem. Here are the results:

10:08:56 | morn at eri:~/dev/rh/lvm2 -> time cp -Rl tailor/_darcs test_tailor
cp -Rl tailor/_darcs test_tailor  0,07s user 23,51s system 95% cpu 24,807 total
10:09:30 | morn at eri:~/dev/rh/lvm2 -> for d in test_tailor/patches test_tailor/pristine.hashed; do (cd $d; time runghc ../../letterify.hs); done 
runghc ../../letterify.hs  1,08s user 0,13s system 87% cpu 1,374 total
runghc ../../letterify.hs  5,60s user 2,37s system 96% cpu 8,286 total
10:10:27 | morn at eri:~/dev/rh/lvm2 -> time cp -Rl test_tailor test2_tailor
cp -Rl test_tailor test2_tailor  0,06s user 1,95s system 99% cpu 2,016 total
10:10:38 | morn at eri:~/dev/rh/lvm2 -> time cp -Rl tailor/_darcs test3_tailor
cp -Rl tailor/_darcs test3_tailor  0,07s user 24,91s system 98% cpu 25,287 total

The tailor repository is a conversion of LVM2's CVS tree by tailor, as it were,
without any manual intervention like optimising or so. You can also see I have
not used darcs at all, just using cp -Rl, meaning recursively hardlink the
directory. The letterify.hs script is as follows:

import System.Directory
import System.FilePath
import Data.List

startsWith :: Char -> String -> Bool
startsWith l s = case dropWhile (/='-') s of
                        [] -> False
                        '-':x:_ | x == l -> True
                        _ -> False
main = do
    sequence [ do
        createDirectory [l]
        files <- filter (startsWith l) `fmap` getDirectoryContents "."
        mapM_ (\f -> renameFile f $ [l] </> f) files
     | l <- ['a'..'f'] ++ ['0'..'9']]

You could possibly try this for yourself on your favourite repository.

The reading of these numbers is:

- flat patches + pristine gives you about 25 seconds of system time to just `cp
  -Rl` the _darcs dir, presumably 99 % of that time being in internal directory
  lookup routines
- just chopping this up into 16 buckets, based on first hex character of the
  hash, gets us down to some 2 seconds, which is more than ten-fold speedup
- the last line is just to try with a little hotter cache, although I believe
  it has been pretty hot with the first time already; I wouldn't expect that to
  have significant impact anyway

So who's with me that for 2.3, we should add a "bucketed" keyword to our
_darcs/format and start using it? Moreover, we need to do the same to our
global cache, by now I get the following:

10:19:06 | morn at eri:~ -> l ~/.darcs/cache/pristine.hashed | wc -l
40645
10:19:10 | morn at eri:~ -> l ~/.darcs/cache/patches | wc -l       
86769

Btw. that also disproves the 32k hard-limit, as this is ext-3 here.

Yours,
   Petr.

-- 
Peter Rockai | me()mornfall!net | prockai()redhat!com
 http://blog.mornfall.net | http://web.mornfall.net

"In My Egotistical Opinion, most people's C programs should be
 indented six feet downward and covered with dirt."
     -- Blair P. Houghton on the subject of C program indentation


More information about the darcs-users mailing list