[darcs-devel] Help with alternate current formats: shall we slurp?

Tue Jan 25 03:48:37 PST 2005

On Tue, Jan 25, 2005 at 11:13:17AM +0100, Juliusz Chroboczek wrote:
> The problem is that I don't know whether you're slurping so many times
> because it was simpler that way (in which case I'll simply eliminate
> all the slurping that's going on), or because you want to discard
> touched Slurpies as often as possible in order to free memory.  In
> either case I know how to proceed, but the ``slurp if it's cheap''
> version will involve going through the Maybe monad quite a lot.

The basic issue is that I'd rather not hold the entire source code tree in
memory ever, so long as this is possible.  There are times (e.g. an initial
record, or creating/using a checkpoing) when one can't avoid holding the
entire source tree in memory, but when it's avoidable, I'd rather not do
it.  And I hope that eventually, via tricks like lazy parsing we can pare
down the number of times that the entire contents need be held in memory.

Slurping itself doesn't read any of the file contents, but slurp_write
*will* read the file contents.  If the slurpy is discarded after the
slurp_write, then the file contents can be forgotten as soon as they are
read.  So in general, whenever I do a slurp_write, I want to discard the
slurpy so it can be disposed of bit by bit as it is used.

> So could you please tell me why you're slurping so often?  For
> example, why do you think that slurping again in Get.lhs:139 will save
> memory?

In the case of Get, we just want to copy current twice, once to
_darcs/current, and once to the working directory.  We do this in two
slurps, so that we won't have to hold the entire tree in memory.

One more thought: I just realized that your change to reinstate the second
slurp in get, when we aren't using current.none actually doesn't help in
that case.  We need to do the second slurp (for s') before doing the
slurp_write s, or the original slurpy will need to be held anyways.

> Taking Pull.lhs, you slurp current once in slurp_recorded on line 140,
> once in slurp_recorded_and_unrecorded at line 154, once in
> write_pending at line 194, and once in sync_repo at line 196.  (I'm
> missing one, I don't remember where it was.)  Which of these slurps
> are necessary, which can safely be removed?

The first slurp actually isn't used, but is there in case we ever add
support for displaying context when prompting for patches.  Making this use
unsafeInterleaveIO would be one way to save time.  The repository is
locked, so unless someone is messing with the repository (in which case
it's their fault you've got corruption), this will actually be safe.  We
could stick the unsafeInterleaveIO right in Pull, or we could have a
delayedSlurpCurrent which does the interleaving itself.

The slurp_recorded_and_unrecorded of course is necesary, since it's how we
get the contents we're going to apply to.  Actually, the "recorded" part of
it isn't necesary if we're using current.none, but doesn't really cost us
anything, and gains us in verifying that the patch actually can be
applied.  And it's necesary in order to slurp the working directory, since
otherwise we wouldn't know which files are in the repository and which are
"junk" files.

The one in write_pending could be eliminated in the common case where there
are no pending changes by first checking whether (is_null_patch $
sift_for_pending p), and if that is the case we don't need to slurp.  Do be
sure to make sure that sift_for_pending only gets called once, as it is
potentially expensive.  Note that this common case (empty pending) should
almost always be the case in "non-working" repos.  The only exception would
be if there are conflicts that got marked, which you probably wouldn't want
to happen in a non-working repo anyways.

The final one in sync_repo is the easiest.  When running with current.none,
sync_repo should be a null operation, since its entire point is to
synchronize the file modification times on files in _darcs/current with
those in the working directory, provided the files are identical.
I.e. it's just an optimization, and it's one that's irrelevant to the
situation where you've got current.none.  On the other hand, it might be
the place to compute hashes when you get around to HashedCurrent, since
sync_repo does get called whenever the repository contents may have been
modified.
-- 
David Roundy
http://www.darcs.net