[darcs-users] Memory usage of Record

Jason Dagit dagit at codersbase.com
Thu Sep 3 15:44:12 UTC 2009


On Mon, Aug 31, 2009 at 2:59 PM, Max Battcher<me at worldmaker.net> wrote:
>
> It sounds a lot like Petr's thoughts on chunky hunks to me... The idea that
> once a chunk gets above a certain threshold, the chunk's data can easily be
> dumped into hashed-storage as a separate unit and the patch itself then
> merely contains a hash reference to that chunk data.
>
> In this particular case it simply becomes a matter of chunking that data
> into hashed-storage as soon as possible during record to avoid huge memory
> usage (keeping only the hash references in memory). (The only real
> complication just being that there are good garbage collection policies in
> place should the record be canceled.)

Having thought more about the problem and the merits of some of the
potential solutions I have an update.

I reread the documentation for Iteratees.  It seems that what they
excel at precisely processing lots of data with one of several goals:
a) extracting a few values (efficient random access is even
supported), b) reducing the data bit by bit to some smaller value
(say, a checksum), c) doing a 'map' style transformation on the data
stream.

Unfortunately for anyone wanting to apply Iteratees like a hammer to a
nail, this isn't going to gain us much with record as is.  The reason
is because of what darcs wants to do with patches once it reads them.
Iteratees could definitely be used to make efficient patch parsers
that don't hold excessive amounts of the patch in memory as long as
we're doing (a), (b), or (c).  The problem is that darcs wants to hold
the whole sequence in memory.  I guess this is where Petr's chunky
hunks[1] shine.  The patch summary is typically small and we could
hold many in memory, but the details can be rather large and holding
them in memory is sometimes equivalent to holding the whole directory
tree of the repository in memory.  The chunky hunks are a win because
they decouple the summary from the details.  Of course, by "patch
summary", I mean enough info about the patch to commute it, at least
in most cases, with other patches.

Naturally, I started to wonder about ways of keeping our existing
patch format but internally separating the patch summary from the
details.  Continuing to use record for inspiration, can we come up
with other ways of decoupling the two?  For example, could we keep the
existing patch format for on disk but in memory use the chunky hunk
representation?  For example, during record we read in the pending,
say using Iteratees, and instead of holding the full patches in
memory, we instead hold in memory just a chunky hunk representation of
the patches.  Actually, I think the current patch parsing may be just
fine for this; I just need to check how much memory it would take to
read all the patches and only hold on to the chunky hunk level of
details.

A change like this requires at least the following:
1) Rewiring commute to operate on chunky hunks
2) Making the chunky hunk representations
3) Rewiring the UI to select changes on chunky hunks and display patch
contents given a chunky hunk

Other things that may be required:
4) Changing the patch parsing do work in a bounded memory stream fashion
5) Storing the pending chunky hunks on disk while the user is
interacting with darcs
6) When showing hunks to the user, only reading as many bytes at a
time as we're able to dump to the output

Am I missing anything big?

#5 requires clean up, and it could be viewed as an alternative
solution to the whole problem I'm approaching, I think.  The biggest
drawback that I can think of for just going with a pure #5 based
approach is that as near as I can tell, commute would still take O(m +
n) memory given patches of size m and n.  Whereas, I think with this
layered approach we can reduce the memory required to commute to the
size of the chunky hunks.  Replace patches being a different case, I
think, but in that case I think we could stream the hunk data and then
never hold in memory more than some reasonable amount.  In other
words, I could probably fix the performance of record to be O(m)
where, m is the size of a patch, with just #5, but then as soon as I
do that I would turn to improving things further and be right back
here.

Eventually, we'll probably even want to not store the whole chunky
hunk sequence in memory.  I suspect that change will be less intrusive
and something which is reasonable to postpone as a 'phase 2'.

How does this proposal sound to people?

If I were starting to work on this tomorrow to address the record
case, I wouldn't have to worry about #1 for a while.  The focus would
be on #2, #3, and checking if #4 and #6 are needed.  I guess I would
work in a private repository until I had something testable.
Although, I would be happy to publish it for those who want to follow
along.

Also on my list of things I want to work on:
* Finishing type witnesses in the commands
* Implementing RIO witnesses
* Creating a written specification of the file system operations darcs
does during various commands

I feel like for the first two on that list I'm waiting for Petr's GSoC
work to be merged.  I'm not eager to finish the witness stuff just to
have to redo a large portion of it or conflict with Petr's already
done work.

Would this record work be orthogonal to Petr's GSoC work, or am I
going to need to wait on that too?  (At some point, I'll be waiting on
too much stuff and I'll get annoyed by that and just start working on
one or more of the items.)

Feedback appreciated!

Thanks,
Jason

[1] http://web.mornfall.net/blog/patch_formats.html


More information about the darcs-users mailing list