[darcs-devel] Compressing Patches, LZO, and that .gz.

Sun Apr 17 06:50:41 PDT 2005

On Sat, Apr 16, 2005 at 10:40:02PM +0100, Ian Lynagh wrote:
> On Sat, Apr 16, 2005 at 07:47:04AM -0400, David Roundy wrote:
> > 
> > I agree that leaving the .gz out would have been a good idea, but I don't
> > think that changing now would be a good idea, since for backwards
> > compatibility we'd have to try both filenames, and that would be a pain.  I
> > think this is a transition we can make when we switch to a hashed
> > inventory, and storing patches according to a hash of the patch contents,
> > since at that time we'll have to deal with a repository format transition
> > anyways.
> 
> Do you mean the patch with hash abcdefg gets stored in a/b/abcdefg, or
> something cleverer?
>
> I think we need to do at least this at some point soon; the Linux repo
> has 27k patches, which I think is approaching the point at which ext2
> starts getting unhappy. I think once we've decided how deep a hierarchy
> we want, all we need to do for 1.0.3 is have darcs look there as well as
> at the root for the file.

No, I hadn't even thought about that, although it's also a good idea...

If we keep the date in the hash (which I like, as it means the probability
of hash collisions won't increase with time), then we could perhaps do
something like
20050202/124633-891bb-0fe04d88648b2bb57542e030a49850b98ce6bc96 which would
lend itself to ready searching of patches, and should be safe as long as
we're under a project taking 30,000 days, or eighty years.  And eighty
years from now, I'd hope filesystems will be all right with more than
30,000 subdirectories.  The other constraint being that projects which
create more than 30,000 patches per day may run into trouble... :)

What I meant is that I'd like the patch filename to be determined by the
content of the patch itself, rather than the current case where it's
determined by only the PatchInfo.  To make this change, we'll obviously
need to store the hashed filename (or at least the hashed contents) in the
inventory, or we won't know how to access the file.

One advantage of this scheme is that we will be able to unpull safely and
transparently in publicly available repositories, since the modified
patches will have new filenames.  Of course, it still won't be a *good*
idea, but the problems will only be sociological--how to tell everyone to
unpull that patch--rather than technical.

The other advantage (which is what led us to this discussion) is that one
can now sign the inventory, which will verify that the contents of the
repository haven't been tampered with--as long as we check that the actual
hashes of the files end up matching their hashes.

Of course, we also need to store inventories/ by hash, and that has the
advantage of making it easier for optimize to reorder the inventory in a
safe manner on a live repository.

> Then in 1.0.4 (or later) we can create the file there instead and
> provide a migration script. Finally, in 1.0.5 (or later) we can stop
> looking in the root. We could also make it a repo option for a bit
> whether we use the hierarchy if backwards compatibility is important.

Indeed, I think we'll have to create both hierarchies for quite a long
transition period.

On the plus side, Linus has decided for the moment to start with a fresh
repository, so we won't have to handle 26k patches for day-to-day darcs
development.  i.e. we can procrastinate.
-- 
David Roundy
http://www.darcs.net