[darcs-users] hashed repos qualitatively for GHC

Petr Rockai me at mornfall.net
Tue Jan 12 00:44:52 UTC 2010


Hi,

Simon Marlow <marlowsd at gmail.com> writes:
> My general perception is that many things are quicker, but operations I would
> expect to be instant (and were very quick with darcs 1), like pulling/unpulling
> 1 patch in a GHC repo, are taking several seconds with hashed repos and darcs
> 2. I've tried optimize --reorder, to no avail. This is resulting in an overall
> impression of sluggishness, which is unfortunate.

this sounds like an inventory writing bug to me. I haven't investigated
this in that much detail yet. Might be related to record always writing
a new inventory, which is also quite bad.

Looking at write_either_inventory, I don't see how this could get a
Nothing for a non-empty PatchSet. This eg. means that every time a
tentative inventory is written, a cascade of redundant inventory writes
happens. Which is about every time inventory is manipulated, including
all pushes, pulls, obliterates and so on. Which probably explains your
perceived sluggishness.

Other than being hugely inefficient, this is probably
correct... Interestingly though, write_either_inventory is never needed
for tentativelyAddPatch, which is what we use for record/amend/etc. What
happens is that a tentative inventory is constructed by appending a new
patchinfo, and it is then also written out through
write_either_inventory as a *hashed* file under inventories/ -- that is
*never referenced* -- garbage accumulates.

On the other hand, obliterate may trigger a sequence of inventory writes
rightfully -- if you obliterate a patch from below a tag, eg. However,
it should never need to write, or even examine, inventories past the
last "optimised" tag. Nevertheless, this happens due to the way
write_either_inventory (write_inventory_private) is designed: it
constructs the inventory from the oldest patches, computing hashes as it
goes, so it can use those hashes as the "start" hash of the following
inventory. This traverses the whole PatchSet representing the inventory,
triggering reads of all the "live" inventory files in the repository,
and then writes them out again.

It is not clear this situation can be improved without reworking
PatchSet. The "obvious" approach would be to go about it the same way as
hashed-storage goes about Tree... (in a way, the inventories are a
"hashed list" or hash-chained list...)

My data structure would be something like:

data PatchSet p = Empty | Fragment Hash (RL PatchInfoAnd p)

and manipulating this structure would clear all Hash values (setting
them to NoHash) from the changed fragment up to the most recent fragment
(since all of these would need to be written out... the principle is the
same as with hashed Tree, but this is just a linear list).

Since the current PatchSet has no way of knowing that it corresponds to
existing inventories, the PatchSet itself does not have enough data to
avoid unnecessary inventory writes. It might be possible to work around
this by passing the data along using some other mechanism, but I am not
sure this is worthwhile.

David has done some PatchSet reworking (renaming it to NewSet I think)
on his branch. Not sure this is relevant to our problem though.

Actually, looking some more at the code, it should be possible to avoid
the expensive write_either_inventory at least in the append case (push,
pull) and maybe also mitigate it for the common case of removing from
the most recent inventory. I am not really sure though, the code is
pretty messy and a cleaner rework on top of a changed PatchSet would
probably benefit the code more than trying to work around its
brokenness.

Yours,
    Petr.


More information about the darcs-users mailing list