[darcs-devel] Re: darcs patch: Change a can't... (optimization discussion)

Thu Apr 14 05:05:20 PDT 2005

On Thu, Apr 14, 2005 at 02:14:24AM +0200, Benedikt Schmidt wrote:
> Benedikt Schmidt <ry102 at rz.uni-karlsruhe.de> writes:
> 
> > David Roundy <droundy at abridgegame.org> writes:
> >> I'd say we should now focus on making the "fast" cases *not* be
> >> proportional in time to the size of the repository.  In particular, I
> >> have:
> >>
> >> $ time darcs whatsnew                           
> >> No changes!
> >>
> >> real    0m5.526s
> >> user    0m4.330s
> >> sys     0m0.980s
> >>
> >> We're spending an incredible amount of time checking timestamps and
> >> reading directory contents.  Linus' git does this in 0.15s elapsed
> >> time.  Similarly a record of a single small change in a kernel
> >> repository takes 22s.  Git's equivalent of record takes 0.8s.
> >>
> >> In both cases, I think our limiting factor is the directory reading
> >> and statting in the slurp functions.  Some of this may be string
> >> conversions, and some of it is the fact that we do a lot of extra
> >> calls to stat and lstat.  It would be nice to know which of these two
> >> problems is the more serious, since they're both tediously be
> >> straightforwardly fixable, but I'm not sure which would be better to
> >> fix first.  I suppose reducing the number of stats would be great,
> >> since I *know* that's the limiting factor when running whatsnew over
> >> nfs...
> 
> I did some more tests and reducing the stats per file from 4 to 1 doesn't
> help that much:
> 
> darcs.old wh  2.71s user 0.80s system 99% cpu 3.523 total
> darcs.new wh  2.34s user 0.32s system 99% cpu 2.671 total

True, it doesn't seem to be the *dominant* effect, but reducing the time
from 3.5s to 2.7s is far from trivial... it suggests that one third of the
time is being spent in .  And, as I mentioned, on some systems (e.g. nfs
perhaps) it may be a bigger deal.  Have you sent in a patch to do this? I
haven't yet finished scanning my email for patches this morning...

> I think we have to do some profiling before deciding what to do next.  If
> the string conversions, something similar or the remaining IO is the real
> problem,

Indeed, that sounds like a good idea, although I tend to be wary of
using profilers--all too often they seem to distort their results one way
or another.

BTW, one of the reasons I'd like to avoid Strings is to make i18n easier.
Strings (or perhaps I should say FilePaths) have an unfortunate ambiguity,
which is that they pretend to be unicode strings, when they're really just
sequences of bytes.  I'm more comfortable with PackedStrings for file
paths, since they really *are* just sequences of bytes, with no pretending
to be unicode characters.

Ghc could horribly break darcs without deviating from haskell 98 in the
least, by simply supporting some different 8 bit encoding.  This makes me
uncomfortable.  I just don't like darcs depending on undocumented features
of the language, and that's what FilePath is.  If you avoid using the FFI
to access files (which isn't acceptable) and don't want to pass store
FilePaths, or pass them from one computer to another, then haskell IO is
all right.  Otherwise, you depend on the actual implementation of the IO
libraries *not* supporting real unicode filenames.  :(

> If the IO is too slow, we can try to adopt some of tricks git uses like
> sorting by inode number before calling stat on the files or using a
> single file where the names, types and permissions of all files in the
> pristine directory are stored. I think git doesn't do any readdir calls
> and we could get rid of most of the calls too when look-for-adds is not
> specified.

Yeah, I remember that git avoids readdir.  But doing that would require
storing lists of files in each directory, which may be as expensive as a
readdir.  Alhough if we stored file lists, we could also store stat
information in the same location, and that would allow us to avoid all the
calls to stat (in _darcs/current), *and* to store things like inode number
if we like, to reduce the danger of not seeing changes due to accidentally
identical modification times.

Sorting by inode number would be an interesting trick, but if we're going
to do it, it had better be done at a very low level, e.g. by a special call
to readdir.
-- 
David Roundy
http://www.darcs.net