[darcs-devel] Proposal for new format to store patch files

Wed Jun 1 15:39:20 PDT 2005

Ian Lynagh <igloo at earth.li> writes:

> On Thu, May 12, 2005 at 12:22:34PM -0700, Jason Dagit wrote:
>> My propsal is to use a format similar to ar (or we could maybe use ar
>> as is).
>> [snip description of ar idea]
>
> When applying patches etc this isn't a problem.

Fortunately apply patches remains fast for small patches.  I have a
repository where the pristine tree is about 200MB.  Some of the
initial patches are 40-60Mb in size.  Getting the repository isn't
really an option at this point.  I believe the last time I tried it,
it took on the order of 6hours.  I can scp the directory in a fraction
of that time, which is what I endup doing when I need to "get".  It's
an "okay" work around since I'm the only one that uses this repo, but
it would be a pain if a real project required this.

> If we just want to know what files a patch affects, or what patches
> affect a file, then we should have a separate index for this info.
> (we have to be careful re: renames, of course).

Is anyone working on this?  Is there a bug that I should be watching?
I could help test out the changes.  I'd like to contribute to the
betterment of darcs.

> It might be worth doing something like what you suggest for when we want
> to see exactly how patch P affects file F.

add, record, push and pull are the only operations which I know for
certain I can do on my huge repositiory.  I'm hesitant to try anything
else.  For example, I've tried "whatsnew -l" before and run out of
memory, although, I think this had more to do with the large number of
files (several gigs) which were not in the repository.  Actually, this
point probably need attention as well.  If you list each of the files
that were not in the repository, it would be a small amount of data.
I suspect darcs was actually reading each file instead of just noting
that the file was not in the repository.  I should try this again with
a recent version of darcs.  It would appear that 1.0.3rc1 has this
problem as well.  I think I'll look at the code and see if I can make
a patch for this.

I guess I'm trying to say that if a file is not in the repository,
then whatsnew should probably stop looking at the file as soon as it
realizes that the the file is new.  

> I think we'd just want our own header at the start though, so rather
> than having
>
>     gzip(foo_hunks, bar_hunks)
>
> we would have
>
>     gzip(foo=0\nbar=sizeof(foo_hunks)\n, EOH, foo_hunks, bar_hunks)
>
> (this would also give us one of the above indices).
> (also, we have to be careful that either all the bits for one file are
> together or that we get them all. Probably best to try to do both).

I respect the idea of starting small.  I'm not up to speed on the
darcs source, but I do know some haskell.  What would be the impacts
of making this change?  How far reaching would it be?  I'm trying to
guage if making this change would be a reasonable way to learn the
darcs source tree.

> This would have the advantage that you can still open the files in a
> pager/editor.

ar would respect this as well.  The header information in an ar file
is stored as plaintext, the contents are only in binary if the file
was binary.  We could still ascii-enarmor if we want.  Actually, if we
don't have binary diffs I see no reason to process the binary file
unless we need to copy it.  I think the slow down on large patches is
largely caused by processing the patch character by character at times
when we shouldn't _need_ to.  I realize at times it may be unavoidable
to process the data, but for those times it seems like a fast "copy
section of file to disk" function (in C if we have to) would be
desirable.

>> [request for premission data in the patch format]

> Storing the info isn't the problem here, it's working out what to do
> with it.

Fair enough, I have have been educated by the nice people of #darcs on
why this is problematic.  I have some ideas brewing for how to get CVS
style root repositories for people that insist on using darcs like
they would use CVS.  But I'm not ready to share those ideas yet.

Thanks,
Jason