[darcs-users] RDF metadata for patch files

Drew Perttula drewp at bigasterisk.com
Wed Mar 25 07:46:11 UTC 2009


Hi- I'm a big fan of RDF, and I use it on all sorts of projects. Here 
are my opinions on Max's questions, plus some more strawmen to keep the 
discussion going:

Formats and RDF store capabilities:

Output RDF/XML for many of the the same reasons the other darcs commands 
have an --xml flag. I think n3 may not be needed at all, if you like the 
property-only approach I recommend below.

Don't bother with sparql or a full triplestore unless you have an 
especially great library to link in. In practice, people like me will 
just want darcs to emit its RDF data (for one patch or for all of them) 
so we can transfer it to our external store of choice, e.g. Sesame. Over 
in *that* store, I'll probably have loaded in other related data too. 
That store becomes responsible for executing queries quickly, etc. To 
keep the data fresh, I might try use some darcs hook to run my 
resync-data-for-this-patch tool. Hopefully there is a suitable hook that 
fires my program whenever a patch's metadata changes.


Patch properties:

I think it would be elegant to design the RDF system as a superset of 
the current metadata system, which means 'author', 'comment', 'name', 
and 'date' should be mapped to some RDF terms. Phase 1 would be to make 
'darcs cha --rdf' emit an RDF/XML document that contains the same data 
as the current --xml output. A big part of phase 1 will be to come up 
with the URI for a patch (see below).

Then, for phase 2, 'darcs add' could gain some general-purpose flags 
that let the user submit arbitrary additional edges off the patch URI. 
Here's where we can connect the patch to a license, a bug ticket, hours 
spent, see-also links, calendar events, etc. I think accepting an 
arbitrary RDF graph with any amount of new structure may overwhelm 
users, but implementing flags like --license sounds too limited.

Again, all of these proposals are for discussion purposes only, but the 
CLI could look like this:

darcs rec --property_literal dc:language "en" \
           --property_uri rdfs:seeAlso http://company.com/docs/feature1 \
           --property_uri rdfs:seeAlso 
http://company.com/docs/deprecationStandard

'--author' becomes a synonym for '--property_literal dc:creator', etc. 
I'm imagining that darcs would know a fixed set of prefixes (like 'dc') 
for convenience, but that it would still be able to accept arbitrary 
URIs for the predicate (aka 'property' aka 'edge label').


Per-file metadata:

As to per-file metadata in darcs, I think that's not necessary. There 
are already ways to embed the file metadata in the file in many cases. 
You can also split your patch into a few pieces (and then combine them 
in a tag, perhaps) if you need the extra granularity.


Patch URIs:

In the output graph, what is the subject (aka 'source') of an edge like 
dc:creator? RDF wants this to be a URI; darcs already has its hash 
codes. Here are some possibilities:

A: 
http://darcs.org/patch/20090323070000-ecde5-cd5fdd37119bcd748942a0bf3d346d1d8da2a9f9
B: 
http://some-url-root-you-entered.com/some/path/20090323070000-ecde5-cd5fdd37119bcd748942a0bf3d346d1d8da2a9f9
C: 
http://your-darcsweb-repo.com/darcs/?r=projname;a=commit;h=20090228090242-312f9-c37d395e337108a7a224650414bc18a58e263481.gz

[A] is automatic, and seems like the Simplest Thing that Could Possibly 
Work. darcs.org may get pounded with futile requests to resolve the 
URIs, though.

[C] is a special case of [B] (plus .gz at the end), and it's cool 
because the URLs would be resolvable. That's a desirable property of RDF 
URIs, though never a requirement.


More use cases:

How do I link a bug ticket with a darcs patch that fixes it? There are 
many ad-hoc schemes that involve putting the link id into the patch 
comment text, but I think the problems there are obvious. I'd like to 
say "this patch fixes bug http://mycompany.com/jira/FOO-345" and then in 
another UI, be able to jump from that bug to the darcsweb display of the 
related patch(es).

I'd love to have an hours-spent value on my patches. Suppose I got my 
IDE^H^H^H editor and shells to watch how long I was active on which 
project, and the output was available to my 'darcs rec' wrapper. This 
would be awesome data to stick in the repo.

On my web project, I have a lot of patches that say "implemented feature 
Z, see http://theproject.com/demo/of/Z for an example". Someday I might 
add a feature where if you're in admin mode, you can jump from a page on 
the site back to the list of tickets (and therefore bugs and 
discussions) that were involved in that page.

I already use URIs for my tag names, so that other systems (e.g. a 
release notes generator) could make more statements about the tags. I'd 
be happy for darcs to be making its own URIs for those tags so I can use 
the comment for free-form text again. As with all RDF, this is not 
impossible or even difficult to do with tag names or their hash ids, but 
it's easier to deal with tons of data sources when they're all in the 
same address space (URIs).

It might be cool to link a patch to the results of a test suite that ran 
on that code. This would help us make UIs that let you jump from the 
point when the tests started taking too long back to the tags just 
before and after that event.


Workarounds:

There is nothing too hard about writing my own RDF and sticking it at 
the end of each darcs comment as XML. It would be easy to find and parse 
such a thing back into a triple store. One could even use XSLT to 
convert the output of 'darcs cha --xml' into RDF that would mesh with 
the new statements inside the comments. So, it seems that this proposal 
is more about formalizing the current metadata in a standard way and 
less about offering a way to store license data in darcs.


Regarding advocacy:

The inkscape/SVG example is nice, but a more convincing demonstration of 
'critical mass' is PDF. Many PDF files (especially ones from acrobat, I 
think) have RDF/XML embedded in them. 
http://www.xml.com/pub/a/2004/09/22/xmp.html

It might help to never say the words 'semantic web', since like many RDF 
applications out there, this has nothing to do with semantics (beyond 
basic stuff, e.g. how darcs has a meaning for the term 'author'). It 
also doesn't involve any kind of web until users want to connect their 
data sources together. That will work really well, but it's not 
necessarily one of the goals of RDF-in-darcs.


More information about the darcs-users mailing list