[darcs-users] patch metadata, annotations, Ignore-this, tagging, etc

Tue Mar 23 04:06:07 UTC 2010

On Mon, Mar 22, 2010 at 12:34 PM, Max Battcher <me at worldmaker.net> wrote:

> On 3/22/2010 14:34, Jason Dagit wrote:
>
>>
>>
>> On Mon, Mar 22, 2010 at 11:02 AM, Max Battcher <me at worldmaker.net
>> <mailto:me at worldmaker.net>> wrote:
>>
>>    Long term I'd like a pony, but more importantly for darcs patches to
>>    be in some easy to parse markup format like JSON, perhaps.
>>
>>
>> What is the concern you're addressing?  What if we made the darcs patch
>> parser callable from C?  Would that be good enough?  Why do you want the
>> patches to be easy to parse in a markup format?
>>
>
> Like I said, it is a wish for a pony. I don't have any specific need in
> mind, but "wouldn't it be nice if...". Given a long term format change, I
> would always prefer something standardized and well known over something
> proprietary and possibly prone to break in unexpected ways. Certainly if a
> standard markup format were in use already by darcs we wouldn't have as much
> problems adding metadata or changing patch formats to meet the needs of
> today.
>
> So, be it JSON or YAML or Binary XML or Google Protocol Buffers or
> something else I haven't considered, it doesn't really matter: the intent is
> that it should be something usefully extensible with known efficient parsers
> and known operating requirements. Which is to say that I appreciate your
> arguments for efficiency, Jason, but precisely because of those arguments
> I've come to strongly appreciate well known parsers over hand-built ones,
> because I know the "operating efficiencies"...
>

I have too many side projects for the amount of time I give them, but one
idea that keeps coming back up in my brain is to use criterion/progression
to benchmark various parsers for the current darcs format.  I was thinking
pitting attoparsec vs. darcs source vs. attoparsec-iteratee vs. pure
iteratee vs. database backend vs. ??.  Comparing memory usage would be in
there too, but I don't think criterion has a way to do that yet.

Would that, along with some asymptotic memory/time analyses, satisfy your
craving?  I ask because it seems like knowing a particular parser/format
works well enough for general purpose usage isn't as good as having evidence
that it works well on a specific specialized task.

> (As in, I know the relative strengths and weaknesses of the various XML
> parsers at my disposal in Python or C#. I know which ones call C backing
> libraries, and I know which ones I'd pick for ease of use and which ones for
> power and which ones for optimal speed/memory.

Keep in mind, with modern Haskell compilers and libraries we should be able
to produce faster code than C if needed.  Don Stewart has demonstrated this
quite a few times :)

> I can choose one to use based on the requirements of the current project.
> Same for YAML or JSON... But each and every "special" or "proprietary"
> parser brings its own learning curve.)

Which one would you pick for a YAML patch format?  Suppose Haskell isn't a
consideration.

>
>
>     When Ignore-this was first implemented the medium term solution of
>>    using a full RFC822 email-like header was broached. Of course,
>>    RFC822 is full of loopholes and surprisingly hard to parse in
>>    reality, but the obvious point that Ignore-this: xxx does indeed
>>    look like an email header still stands. (I'd like to remain on the
>>    record that I'd still prefer a better name like "Patch conflict
>>    avoidance hash" than Ignore-this, by the way.)
>>
>>
>> Yeah.  I think that's fair.  Are there no parsers for RFC822 on
>> Hackage?  I see this:
>>
>> http://hackage.haskell.org/packages/archive/mime/0.3.2/doc/html/Codec-MIME-Parse.html
>>
>> Does that provide the type of parser you're looking for?
>>
>
> RFC822 is an ugly standard to parse: headers end at the first empty line,
> except in the case when a malformed gateway adds extra spaces everywhere, in
> which case it might be any invisible line that "seems correct"...  RFC822 is
> still a better standard than the current lack of a standard for Ignore-this
> headers, but not by much.

I'm a little confused by the flow of the conversation here.  Are you
implying that even if we had a tested/robust RFC822 parser in Haskell you'd
rather we didn't use that format?

>
>
>     I've been thinking on this some, and I think I have a reasonable
>>    suggestion that is easier to parse than RFC822, but carries a
>>    similar effect: YAML formatted darcs comments.
>>
> >
>
>> That YAML snippets seem pretty reasonable as long as they don't require
>>
>> the parser to hit an ending tag while parsing the patches themselves
>> (seems reasonable for a short-ish section of headers though).
>>
>
> YAML was designed for streaming, definitely. In particular, even the most
> inefficient parser should respect the explicit end of document marker (...)
> and not need to parse past it before returning results. All of the YAML
> parsers I've seen are generally much more efficient than that, of course,
> and I think the YAML specs make it relatively clear how self-contained and
> easy to parse all of the markup is.

I really need to learn more about YAML.  I've never needed it or studied it.

>
>
>  For the
>> patches I really think we want a format that is more amenable to
>> streaming or seeking.  You could imagine it having a "table of contents"
>> section with offsets that can be seek'd to.  I guess strictly speaking
>> that is doable in an XML schema, but perhaps uncommon.
>>
>
> Seeking probably would be a good property to include on the list of
> features to prefer when searching for a new long term patch format.

Just some musings about a pony format:

Yes, this and keeping as much on disk as possible while inspecting a patch
sequence lead me recently to wonder again about using a 3rd party database
as the storage.  Sqlite is easy, but not my favorite (Mainly I dislike the
lack of foreign keys and type enforcement.  Those are merely annoying but
not show stoppers due to features like triggers and using a typed
programming language to interact with sqlite).

It seems like if we used a relational db we'd be forced to store patch hunks
in the filesystem, but that's probably for the best anyway.  With the hunks
stored on disk separately you'd almost never need to have them in memory (I
think).  I guess maybe the initial diff that created the patch or a replace
patch might require it.  Perhaps some conflicts.  Basically the patch
inventory would be in a table and indexed so that hopefully we'd see good
performance when interacting with it.

I expect we'd still need hashed-storage to efficiently query/update the
filesystem and we'd probably also want the filecache (not sure).  So I'm
really only talking about storing the patch metadata in the database.  With
sqlite, I think that means you could have very large histories before you
ran into the limitations.  I think sqlite is also good at only loading as
much in memory as it really needs to (unlike current darcs).  Anyway, I
haven't thought enough about this to have a concrete proposal or even a
draft schema.

I guess it wouldn't have to be a relational db, but berkely db didn't turn
out to work well for the svn folks so I'd be hesitant to pick that one.  My
understanding was that each release would break binary compatibility with
repositories and it was a headache for users.  I've heard that sqlite has a
much better track record in this regard.  I don't know enough about
hashed-storage to say whether it could be used to store the inventory/patch
metadata.

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20100322/3bd1d3c0/attachment-0001.htm>