[darcs-users] RDF metadata for patch files
drewp at bigasterisk.com
Wed Mar 25 07:46:11 UTC 2009
Hi- I'm a big fan of RDF, and I use it on all sorts of projects. Here
are my opinions on Max's questions, plus some more strawmen to keep the
Formats and RDF store capabilities:
Output RDF/XML for many of the the same reasons the other darcs commands
have an --xml flag. I think n3 may not be needed at all, if you like the
property-only approach I recommend below.
Don't bother with sparql or a full triplestore unless you have an
especially great library to link in. In practice, people like me will
just want darcs to emit its RDF data (for one patch or for all of them)
so we can transfer it to our external store of choice, e.g. Sesame. Over
in *that* store, I'll probably have loaded in other related data too.
That store becomes responsible for executing queries quickly, etc. To
keep the data fresh, I might try use some darcs hook to run my
resync-data-for-this-patch tool. Hopefully there is a suitable hook that
fires my program whenever a patch's metadata changes.
I think it would be elegant to design the RDF system as a superset of
the current metadata system, which means 'author', 'comment', 'name',
and 'date' should be mapped to some RDF terms. Phase 1 would be to make
'darcs cha --rdf' emit an RDF/XML document that contains the same data
as the current --xml output. A big part of phase 1 will be to come up
with the URI for a patch (see below).
Then, for phase 2, 'darcs add' could gain some general-purpose flags
that let the user submit arbitrary additional edges off the patch URI.
Here's where we can connect the patch to a license, a bug ticket, hours
spent, see-also links, calendar events, etc. I think accepting an
arbitrary RDF graph with any amount of new structure may overwhelm
users, but implementing flags like --license sounds too limited.
Again, all of these proposals are for discussion purposes only, but the
CLI could look like this:
darcs rec --property_literal dc:language "en" \
--property_uri rdfs:seeAlso http://company.com/docs/feature1 \
'--author' becomes a synonym for '--property_literal dc:creator', etc.
I'm imagining that darcs would know a fixed set of prefixes (like 'dc')
for convenience, but that it would still be able to accept arbitrary
URIs for the predicate (aka 'property' aka 'edge label').
As to per-file metadata in darcs, I think that's not necessary. There
are already ways to embed the file metadata in the file in many cases.
You can also split your patch into a few pieces (and then combine them
in a tag, perhaps) if you need the extra granularity.
In the output graph, what is the subject (aka 'source') of an edge like
dc:creator? RDF wants this to be a URI; darcs already has its hash
codes. Here are some possibilities:
[A] is automatic, and seems like the Simplest Thing that Could Possibly
Work. darcs.org may get pounded with futile requests to resolve the
[C] is a special case of [B] (plus .gz at the end), and it's cool
because the URLs would be resolvable. That's a desirable property of RDF
URIs, though never a requirement.
More use cases:
How do I link a bug ticket with a darcs patch that fixes it? There are
many ad-hoc schemes that involve putting the link id into the patch
comment text, but I think the problems there are obvious. I'd like to
say "this patch fixes bug http://mycompany.com/jira/FOO-345" and then in
another UI, be able to jump from that bug to the darcsweb display of the
I'd love to have an hours-spent value on my patches. Suppose I got my
IDE^H^H^H editor and shells to watch how long I was active on which
project, and the output was available to my 'darcs rec' wrapper. This
would be awesome data to stick in the repo.
On my web project, I have a lot of patches that say "implemented feature
Z, see http://theproject.com/demo/of/Z for an example". Someday I might
add a feature where if you're in admin mode, you can jump from a page on
the site back to the list of tickets (and therefore bugs and
discussions) that were involved in that page.
I already use URIs for my tag names, so that other systems (e.g. a
release notes generator) could make more statements about the tags. I'd
be happy for darcs to be making its own URIs for those tags so I can use
the comment for free-form text again. As with all RDF, this is not
impossible or even difficult to do with tag names or their hash ids, but
it's easier to deal with tons of data sources when they're all in the
same address space (URIs).
It might be cool to link a patch to the results of a test suite that ran
on that code. This would help us make UIs that let you jump from the
point when the tests started taking too long back to the tags just
before and after that event.
There is nothing too hard about writing my own RDF and sticking it at
the end of each darcs comment as XML. It would be easy to find and parse
such a thing back into a triple store. One could even use XSLT to
convert the output of 'darcs cha --xml' into RDF that would mesh with
the new statements inside the comments. So, it seems that this proposal
is more about formalizing the current metadata in a standard way and
less about offering a way to store license data in darcs.
The inkscape/SVG example is nice, but a more convincing demonstration of
'critical mass' is PDF. Many PDF files (especially ones from acrobat, I
think) have RDF/XML embedded in them.
It might help to never say the words 'semantic web', since like many RDF
applications out there, this has nothing to do with semantics (beyond
basic stuff, e.g. how darcs has a meaning for the term 'author'). It
also doesn't involve any kind of web until users want to connect their
data sources together. That will work really well, but it's not
necessarily one of the goals of RDF-in-darcs.
More information about the darcs-users