[darcs-devel] New Patch Format sticky spots

Sat Apr 22 10:15:07 PDT 2006

On 4/22/06, Tommy Pettersson <ptp at lysator.liu.se> wrote:
> On Thu, Apr 20, 2006 at 10:56:09PM -0700, Jason Dagit wrote:
> > 1) How many bytes do line endings add to the length of the old or new
> > content?  Is it okay to assume line endings are exactly one byte in
> > patches?  I know this will hold in unix-land, but what about win32?
>
> Darcs doesn't do different line endings. \n is a line ending,
> \r\n is a line with a \r as the last char, \r (old Mac) is not a
> line ending. Any conversions will be done by external filters
> when they are implemented.

Alright, in that case I think the way I calculate the length of the
old and new content  has a hope at working on all the platforms.

> > 2) Currently when using darcs interactively (in darcs record for
> > example) what you see on the screen is a dump of what goes into the
> > patch file.  So the direct result of my new patch format is that the
> > patch goes from being easily readable by humans to a bunch of garbage
> > all lumped together.
>
> I think the original thought was to have the patch file format
> be very human readable (and editable/repairable) and use it also
> as screen format, only slightly improved with colors and such,
> so that patches looked the same everywhere. Now efficiency has
> become more important.

I did a very simple record benchmark of my new code vs. the status
quo.  I was very surprised at what I found.  I tried recording a 360mb
patch and it took ~40 minutes with the status quo and ~130 minutes
with the new format (both of these compared to the ~6 minutes to
record if I disable the reading of the patch immediately after the
record, so you can do the math to see how long it's actually spending
reading the patch).  It just blows my mind sometimes how unintiutive
performance tuning can be with Haskell.  My new hunk reading code is
essentially:

lines (take n s)
  where
  n = length in bytes of either old or new
  s = patch data as a Stringalike

Of course it's a bit different because I use the sal_foo functions,
return the unused portion of s, and I had to write my own sal_lines
because I didn't see one (but I modeled it after a definition of lines
that I found in the prelude).  I had thought this would be a really
efficient strategy.  Guess I was wrong.

> One option is to try to balance so that
> the format is both efficient and human readable. Otherwise there
> has to be either some conversion, probably from file format to
> screen format, as the coloring already does, but per patch type,
> or the patch interface needs to have two different write
> functions. The latter is probably better, maybe it's possible to
> use a class to default screen write to file write for all old
> patch types?

Alright, if I can get the performance of the patch reading to be
'acceptable' then I'll consider this as it sounds like a really good 
idea.  Although, unless I can get the patch reading to be more
efficient I doubt I'll bother.

Thanks,
Jason