[darcs-users] Possibly a very simplistic solution

David Roundy droundy at abridgegame.org
Thu May 20 11:13:15 UTC 2004

On Wed, May 19, 2004 at 02:33:28PM +0200, Ketil Malde wrote:
> David Roundy <droundy at abridgegame.org> writes:
> > we'd be able to avoid system calls to a large extent.  On big repos, the
> > bottleneck is often stat(2) calls to find modification times and file
> > sizes.
> I tried on a very small repo, just three files and one subdir.  A
> 'whatsnew' (with a couple of changes) does:
> 56 stat64
> 38 fstat64
> 29 mmap
> ~24 each of read, open, write, close
> I don't know how this scales, but couldn't the results be cached in
> some of these cases?
> Each source files seems to be stat64'ed four times, and their copy in
> "current" six times, for instance.

Hmmmm.  This is about twice as many stats as I'd expect.  I guess this is
because some of the haskell standard library routines are calling stat64.

Some of this could be cut by about two at a cost, since before doing the
diff, darcs first goes through and checks every potentially modified file
and checks if it was really modified, and if not it sets the file
modification time to the current version to be the same as that in the
working directory (this function is called "sync" because it synchronizes
the mtimes of identical files).  This means we stat every file twice--once
when syncing and once when diffing.  It also means every *modified* file is
read twice--again a waste, but in this case it's scaling with the number of
modified files.

This could be avoided by creating a simultaneous diff-and-sync function,
but that would be a bit nasty, since diff itself is an ugly function.
Also, it would eliminate the laziness in diff, which would be unfortunate.

Another way around it would be to not sync every time--we could randomly
decide whether or not to sync, which would speed things up most of the time
by a factor of two on large repos.  The catch would that if you record a
very very large change, you may find afterwards that whatsnew/record are
very very slow for a while, since they would keep running diff on all the
files that you touched in that previous record.

A third possibility would be to run the sync after each record, pull or
apply, rather than before each whatsnew or record.  The advantage here is
that the "fast" whatsnew isn't slowed down, but instead the "slow" pulls
and applies are slowed down.  Also, it eliminates redundant syncs.
David Roundy

More information about the darcs-users mailing list