[darcs-devel] binary patches and diffing

Sat Sep 3 07:09:32 PDT 2005

On Sat, Sep 03, 2005 at 03:51:33AM +0100, Ian Lynagh wrote:
> On Fri, Sep 02, 2005 at 03:07:51PM -0700, Bill Trost wrote:
> > So, I propose that a binary patch consist of a sequence of the 5-tuple
> > 
> > 	(old-start, old-length, new-start, new-length, bitmask)
> > 
> > Applying a binary patch consists of doing the following with each
> > 5-tuple:
> > 
> > 	1. Truncating or zero-extending the bitmask and source region so
> > 	they are as long as the target region.
> > 
> > 	2. XORing the bitmask and source region to generate the target
> > 	region.
> 
> Sounds good to me. I think commuting with current binary patches is
> possible, albeit slightly nasty.

Sounds good to me too.  But I don't think binary patches should ever
commute... to me that's a lot of what defines a binary patch.  You don't
want to ever merge two changes to a binary file and get a merged
result--the assumption is that anything darcs does other than record
versions may "break" the file.  This makes binary patches simpler, since
they just need to be invertible.

> We could also do
> (old-start, old-length, new-start, new-length, old-content, new-content)
> of course, but once you stop storing the complete file in each patch I
> think having the actual content in the patch has very little advantage
> over the bitmask idea.

This format might be nice for non-binary files, though.  What you describe
is basically a character-based hunk patch type, which would be a nice
feature for certain sorts of text files.

Interestingly, the diff code for generating character-based hunk patches
could be the same as the code for generating the binary patches described
above, except that when generating binary patches you might have different
priorities (fewer larger hunks might be better, and we might want to scale
better to larger files, since we benefit only in space from breaking a
binary patch into smaller hunks).  But there could certainly be a lot of
code reuse between the two.  And we also ought to be able to leverage our
existing diff.

> Does this all fly? Presuming I'm not completely out in left field, it
> seems like integrating the format into darcs promptly would be useful,
> just to ease the backward-compatibility burden. The bleeding-edge
> version of darcs could start generating full-body XORs soon thereafter,
> and people could start experimenting with other diff algorithms at their
> convenience. It might even be useful to add a "--binary-diffs" flag to
> "darcs optimize" someday, especially for those maintaining public darcs
> repositories.

Indeed that sounds like a good plan of attack.  Hopefully our new
"RepoFormat" code will also ease the forward and backward compatibility
issues (and I'm definitely looking forward to our first opportunity to make
use of that framework).

Bill, were you interested in implementing some of this support? i.e. are
you either a haskeller or one who'd be interested/willing to learn haskell?
It's a great language...
-- 
David Roundy
http://www.darcs.net