[darcs-users] Binary patches

Sat Dec 6 14:21:06 UTC 2003

On Fri, Dec 05, 2003 at 08:34:33AM -0800, Kevin Smith wrote:
> Sean E. Russell wrote:
> >Considering that darcs doesn't allow "sub projects", having robust support 
> >for binaries is doubly important for support of dependancies.  
> 
> I agree with this, although my own needs for binary files are not as 
> great as what you described. In my main project at work, we probably 
> have several megabytes of binary files, but they rarely change.
> 
> >The reason why I'm harping on this now is because this is an architectural 
> >issue -- it is something that, if in the future you decide to change, will 
> >cause backward compatability issues in the repository.  IE, it is easier 
> >to change it earlier than later.
> 
> I'm not sure I agree with this, entirely. Adding a new patch type later 
> for binary diffs would be easy. The only downside is that older versions 
> of darcs will not be able to handle those patches. I really don't know 
> how important backward compatibility is.

I definitely plan to add new patch types, but backward compatibility is
important.  In general, the goal is that each new stable release will only
support *writing* patches that can be *read* by the previous stable
release.  Thus any two successive stable releases could be used together
easily.

Probably this will be accomplished by issuing point releases to stable as
new patch types are added (adding read support), so that when darcs 1.2.0
is released, darcs 1.0.x will already have support for reading all the new
patch types.  In the interrim, my plan is that 1.1.x (the coming unstable
branch) by default, won't create any new patch types unless you provide
some sort of --backwards-incompatible flag, so that 1.1.x can be used
together with 1.0.x unless you explicitly choose to test out the new patch
types.

So if binary diff patches aren't in 1.0, support for creating them won't be
added (except for testing) until 1.2.

On Sat, Dec 06, 2003 at 12:05:24AM -0500, Sean E. Russell wrote:
> You're probably right.  I was thinking that having to support two types of 
> patches would be troublesome, but it may not be.

No, it wouldn't be very troublesome to have two patch types.  On the other
hand, it would be 

> David, is your reluctance due to the effort in implementing a decent binary 
> diff algorithm, or in some practical application of the algorithm -- such as 
> time and space requirements?

Just to the effort of implementing it.  Not that it would be *that* much
effort, but it's also not very interesting to me.  I like patches that
commute, and binary patches don't (which makes them much easier, but far
more boring).

This would be a bit of code an interested non-haskell-coder could
contribute.  All that would be needed would be three functions:

/* bindiff puts diff of old and new in diffbuf (which has a size of oldlen
   + newlen + 8 and returns the number of bytes used in the diff. */

int bindiff(char *old, int oldlen, char *new, int newlen, char *diffbuf);

/* apply_bindiff -- newbuf has length oldlen + difflen
   returns the size of new version, negative value on error. */

int apply_bindiff(char *diff, int difflen,
                  char *old, int oldlen, char *newbuf);

/* invert_bindiff -- invdiffbuf is size difflen */

void invert_bindiff(char *diff, int difflen, char *invdiffbuf);

The only tediousness is that most existing binary diff methods don't create
invertible diffs.  And the diffing code should be small enough to be
actually included in darcs, rather than calling an external library.  I'd
rather not trust that external libraries will never change the meanings of
their patches.  Also, I'd like to be able to see what the diff is doing.

As far as time and space complexity goes, anything based on a simple LCS
algorithm on a byte-by-byte basis would be unacceptable, but there are
nice algorithms out there for breaking a binary file into chunks--this is
the hard part of any binary diff algorithm.  Actually, if I had a function:

int find_next_breakpoint(char *data, int datalen);

the rest could easily be done in haskell (and I'd be happy to do it).  But
the find_breakpoint function would have to use one of the aforementioned
clever algorithms, so that it doesn't either break the file into too many
small chunks (as might happen, for example, if you broke on the null
character, or any other given character) or fail to break similar files
such that they have common chunks (which would eliminate space savings due
to using binary diffs).
-- 
David Roundy
http://www.abridgegame.org