[darcs-devel] binary patches and diffing

Fri Sep 2 15:07:51 PDT 2005

Hi, all,

I just recently installed darcs, and have been giving some thought
to its handling of binaries (an important topic for me at work,
unfortunately). It slowly dawned on me that there are actually two
separate problems here: Computing a binary diff, which is likely Hard
and, for some, Uninteresting, and binaries patches, while probably still
Uninteresting, can be quite simple.

So, I propose that a binary patch consist of a sequence of the 5-tuple

	(old-start, old-length, new-start, new-length, bitmask)

Applying a binary patch consists of doing the following with each
5-tuple:

	1. Truncating or zero-extending the bitmask and source region so
	they are as long as the target region.

	2. XORing the bitmask and source region to generate the target
	region.

An important thing to note here: This binary diff format is its own
inverse, so the format gives us invertible patches, so long as the
entirety of the new and old versions of the file are covered. The
algorithm is also very simple, so it would be safe to add it to stable
branches of darcs.

We can use this patch format to create a simple binary patch algorithm:
bitwise XOR of the old and new versions of a file. This is fast, easy to
implement, and, best of all, generates a patch roughly the size of the
larger of the two versions of the file -- potentially a 50% savings over
the current binary patch format.

More clever diff algorithms can be easily supported. For example,
swbdiff[*] can generate a diff for a file where a kilobyte of data is
added to the middle of a 2K file using the following patch:

(0, 1024, 0, 1024, 0)
	The first 1K of the new file is the same as the old file (the
	single zero gets extended to the full 1024-byte range).
(0, 0, 1024, 1024, <new-file-bits>)
	The second 1K of the new file is new data, encoded directly (the
	region in the old file gets zero-extended and XORed against the
	bits).
(1024, 1024, 2048, 1024, 0)
	The third 1K of the new file is a copy of the second 1K region
	of the old file.

Real diff algorithms should be able to use this format: My impression is
that xdelta generates diffs with formats similar to the previous example
(although not necessarily invertible), and while bsdiff uses addition
and subtraction instead of XOR, I would expect that a substantially
similar algorthm using XOR would have equivalent performance.

Does this all fly? Presuming I'm not completely out in left field, it
seems like integrating the format into darcs promptly would be useful,
just to ease the backward-compatibility burden. The bleeding-edge
version of darcs could start generating full-body XORs soon thereafter,
and people could start experimenting with other diff algorithms at their
convenience. It might even be useful to add a "--binary-diffs" flag to
"darcs optimize" someday, especially for those maintaining public darcs
repositories.

What do people think?

Thanks,
Bill

[*] Super-whiz-bang diff, of course.