A quick response from Bruce, so lets try to get another cycle through before the end of the weekend. More jumbled thought follows, I hope it makes some sort of sense. This will now really will have to stagnate for a while.
Here's take 2. Notes follow.
Page Ranges
Bruce makes two points: the page range should be an attribute of the partOf element, and they should be more general (ranges, not page ranges, in other words).
The page range can't directly be part of partOf; RDF does not allow it. I recommend a scan through at least the first few screensful of the RDF Primer. The RDF data model is graph structured, and RDF/XML is but one serialisation format it. The nodes in an RDF graph represent "resources", and are of three types: URI, blank, and literal. A URI node means "the resource indicated by this URI", a blank node means "some resource", and a literal is, well, literal. Nodes are connected by arcs, the meaning of which is indicated using URIs. RDF/XML is basically "striped" all the way down, so level 1 (inside the rdf:RDF element) elements represent resources in the RDF model, level 2 elements represent properties of those resources (arcs in the RDF graph), level 3 elements represents the resources which are property values, and so on. The graph structure is encoded in tree-structured XML through the use of rdf:nodeID, rdf:about, rdf:ID, and rdf:resource attributes.
For the sake of discussion, a more keyboard- and eye-friendly format is generally used to represent RDF. The graph is broken out into a set of (subject, predicate, object) triples. Thus
<foaf:Organization rdf:nodeID="iwap">
<dc:title>IWA Publishing</dc:title>
<!-- foaf:mbox, foaf:mbox_sha1sum, and foaf:homepage uniquely
identify that which it is a property of -->
<foaf:homepage rdf:resource="http://www.iwapublishing.com/" />
</foaf:Organization>
can be written
_:iwap rdf:type foaf:Organization .
_:iwap dc:title "IWA Publishing" .
_:iwap foaf:homepage
Note that the element named foaf:Organization in the XML generates a blank node of rdf:type Organization. In RDF/XML, the rdf:nodeID attribute is used to give a blank node a name within the scope of the XML document in order that the full graph can be built up; in the triple form an _: prefix indicates a local name for a blank node in the same way, hence rdf:nodeID="iwap" becomes _:iwap.
In the extract quoted by Bruce,
<biblio:Publication rdf:nodeID="oshinfDAbbott">
<dc:title>Discussion of The relevance of Open Source to Hydroinformatics by Hamish Harvey and Dawei Han</dc:title>
<biblio:author rdf:nodeID="abbott" />
<biblio:availableFrom rdf:resource="http://www.iwaponline.com/jh/004/jh0040219.htm" />
<biblio:partOf rdf:nodeID="jhinf5.3" />
<biblio:startPage rdf:datatype="http://www.w3.org/2001/XMLSchema#int">203</biblio:startPage>
<biblio:endPage rdf:datatype="http://www.w3.org/2001/XMLSchema#int">206</biblio:endPage>
<!-- Something more specific that "cites" (such as "is discussion
of") would be better here. Just imagine that publishers could be persuaded
to generate this information. Then when you find an interesting article
you could easily search for later discussion! Then again, publishers could
easily add a forward hyperlink from an article to later discussion now, but
don't necessarily do so. -->
<biblio:cites rdf:nodeID="oshinf" />
</biblio:Publication>
our triple presentation would look like this:
_:oshinfDAbbott rdf:type biblio:Publication .
_:oshinfDAbbott dc:title "Discussion of The relevance of Open Source to Hydroinformatics by Hamish Harvey and Dawei Han" .
_:oshinfDAbbott biblio:author _:abbott .
_:oshinfDAbbott biblio:availableFrom
_:oshinfDAbbott biblio:partOf _:jhinf5.3 .
_:oshinfDAbbott biblio:startPage 203 .
_:oshinfDAbbott biblio:endPage 206 .
_:oshinfDAbbott biblio:cites _:oshinf .
So we have a resource representing the publication of interest, which is connected by a set of arcs to resources which describe or further specify it. Because of the additional semantics of RDF/XML over vanilla XML — namely that the XML specifies an RDF graph — we can't add content inside property elements. While it is possible to say things about types of arc
biblio:partOf biblio:startPage 203 .
what this actually says is "the property biblio:partOf has a biblioStartPage of 203", not, "this instance of the property …".
When used to thinking in XML terms this may seem to be a limitation, but it fact it isn't that important. An RDF resource can be declared to be of any number of types; the obvious solution here is to declare our publication, in this case Mike Abbott's discussion of my paper, to be of rdf:type biblio:Part:
_:oshinfDAbbott rdf:type biblio:Part .
We could even declare biblio:Part to be a subclass of biblio:Publication and avoid needing to specify both. If a biblio:Part can have a biblio:partSpec property, the value of which can be a biblio:Range, and biblio:Range encapsulates the general notion of parts of a monograph then I think we solve the second objection, too (which is a good one: I am exposing the fact that I have a lot of the same prejudices that the designers of BibTeX had).
_:oshinfDAbbott biblio:partSpec _:abDRange .
_:abDRange rdf:type biblio:Range .
_:abDRange biblio:start 203 .
_:abDRange biblio:end 206 .
Those starts and ends still need to be generalised. I wonder if anyone has done any work on a "parts of things" vocabulary?
This model seems quite reasonable to me; the page range is after all a property of the document, and not just a feature of its part-ness.
Names with roles
I've switched to using dc:creator for this. "Authorship" can be assumed if the resource is textual. Editorship needs to be explicitly noted. The approach to exrpessing the data in MODS is not entirely applicable in RDF, where the node representing a person simply represents that person, independent of their involvement in any relationships within the RDF graph. Thus the relationship between a citable resource and a person must be specified in the arc type. Properties can be subproperties of others, so for example biblio:editor could subclass dc:contributor.
The core of my thinking here is that authors and such are people, not names. RDF allows us to express that quite nicely. It might be necessary to use something like
_:foucaultM rdf:type foaf:Person .
_:foucaultM biblio:personID
As with foaf:mbox and foaf:homepage, biblio:personID can be defined as an inverse functional property in OWL, which means that any resource with the same value of a biblio:personID is the same person.
This still leaves the question of how one might record the name used for an author on a particular publication. Together with the need to record author precedence (RDF semantics discard any ordering in the RDF/XML) this suggests a slightly more complex structure is required here.
Origin info
I am unconvinced by the collation of publisher and place. I think this is an artefact of how information is used in citation, rather than a reflection of the structure of reality (whatever that it …) (are the address of a publisher and the place where a speech delivered structurally similar in terms of citation, even?).
I hope (but don't know) that using for example OWL, it would be possible to specify that if a resource has a publisher, and the publisher has an address, then the place of publication of the resource is the address of the publisher. With for example a speech, place information could be attached directly; it is a property of the speech. The location of publication is a property of the publisher, not of the published item.
Because RDF is trying to model resources, not just data, I'm not keen on inserting spurious indirection. The place at which a speech was delivered is the value of a property of the speech, not of some originInfo resource, which itself is the value of a property of the speech. I know that Shelley Powers for example thinks differently, however.
Related items and parts
The need for a more elaborate model of part-ship is acknowledged, more thought is needed here. RDF is all about related items. The type of the relationship is indicated by the choice of URI used to represent it, which in turn is expressed in RDF/XML as the element name (remember that the namespace prefix expands to the full URI indicated in the rdf:RDF start tag). Because RDF/XML is expressing an RDF graph in a very specific way, the elements cannot be specified with arbitrary semantics. Incidentally, I believe this is a good thing, and touched on why in my previous post.
A possible source of communication difficulty here is that I see part-ship being modelled as a relationship between two fully fledged resources. In the MODS approach the contained related item is subordinate to the primary record. In the particular case that a resource only exists to be indicated as a related item of another, standard RDF/XML allows it to be shown in this way, but the RDF graph which results from deserialisation is not altered by this. This is another (and related) interesting aspect of RDF: the data model and the serialisation are less closely coupled than when using XML.
As regards capturing volume and issue numbers as part of the part relation, in elevating a issue of a serial to the status of a resource in its own right I have done something along these lines. Again, I would say that the volume and issue number are a property of the issue (I suspect that inserting a "volume" entity would be overkill).
Other changes
- The
dctermsnamespace added so we can use qualified DC terms. - Property
biblio:partOfchanged todcterms:isPartOf. - Properties
biblio:yearandbiblio:monthchanged to eitherdcterms:issuedordcterms:createddepending on whether the resource was formally published or not. For the time being I have used a truncated form of standard XML dates giving only the year and month, which I'm pretty sure is invalid. - Property
biblio:citeschanged todcterms:references. - Property
biblio:authorreplaced withdc:creator. "Author" doesn't add any information in this context, the creator of a written work is an author by definition. - Added
biblio:genreproperty (in the record for Michael Crichton's speech). This could take values from any existing authority, so long as they were available as URIs. - Made that speech record a blank node, and used
dcterms:locationto identify the location of the transcript. Could this be modelled better? - Added a
biblio:placeproperty to the speech. The semantics of this need thinking about.
Comments