Taxonomy, folksonomy, and value

Let's try again.

Shelley Powers has written a good summary of some arguments about the relative merits of tagging and more formal metadata. Nice pics, too.

Shelley is worried that the rise of "cheap" metadata, in particular tags, will inhibit development of tools to make "real" metadata available to the masses.

Theory: the value of terms in any metadata system used in an uncontrolled way tends to zero with time.

That's just as true for rich, formal so-called “ontologies” as it is for tag soup. There is a difference, however. Tags don't lie about their status. We know, right off, that they're just terms which people have attached to things, for some reason. We know that we don't know what sense the term is meant in, of the undoubtedly many possible. We know that we don't know the intention of the tagger in tagging. The spam problem arises from this: spammers are applying tags in order that pictures show up in certain places, not because they think those tags are relevant. The business of tagging flickr photos with the "offensive" tag is another case in point, where the intention is now to stop items showing up in certain places, and really has nothing to do with offensiveness.

Problematic, perhaps, but not unique to tagging. Exactly the same is true of uncontrolled use of any classification system, however good the tools are which provide access to it. Things will be tagged in a way which is inconsistent with the intended use of any system, either through ignorance or intent.

The rel="nofollow" tag which Google has announced it will use to control the inclusion of links in its PageRank calculations is indeed a nice demonstrator of an important point here. It's meaning is very well defined: it is grounded not in human understanding, but in machine behaviour. Nofollow can't be spammed: either Google honours it, or they don't. The web now has a mechanism for systematically witholding page rank. What it does with it is out of Google's control.

Point is, that some terms have meaning in this, mechanical way, like those which make up a programming language. Others have meaning in a fluid, human way. The latter will always be as fluid as they are human; no fancy tools will change that.

Fancy tools to make use of formal taxonomies easier would be great, of course. Very useful for those who care enough to put in the effort to understand, and make sure they stick to, the formal intended meanings of terms. But however easy that is made, it will always be hard, and people will always get it wrong. And as people get it wrong, the extensional definition of the term becomes less precise, and the value of the term tends to zero. You might slow the process down, but you can't, simply can't, stop it.

And, of course, because the effort involved in understanding and carefully applying the terms of a formal taxonomy is relatively high — independently of tools, which can reduce the accidental difficulties but cannot attack the essence — it is just not useful in the sort of "fire and forget" systems like del.icio.us, where any effort is too much.

In the comments to Shelley's post, Nick Sweeney writes:

I’m a bit of a Saussurean about this, in that I think that taxonomy (or ontology, depending upon your disciplinary point of origin) is crystallised/calcified folksonomy. Authorised folksonomy, if you like.

Right on! Those two terms, “crystallised” and “calcified” are a really useful duo. The one has connotations of order, beauty, and value. The other of dull rigidity. Well designed formal taxonomies have the first of those in abundance. Both have the second, and you can't avoid it. A taxonomy can only ever be an encoding of one incomplete and imperfect view of the world.

Shelley:

I agree with Clay that the semantic web is going to be built ‘by the people’, but it won’t be built on chaos. In other words, 100 monkeys typing long enough will NOT write Shakespeare; nor will a 100 million people randomly forming associations create the semantic web.

100 monkeys indeed will not write Shakespeare, but 100 Shakespeares didn't write Shakespeare either. Shakespeare did. The situation we have is that of 100 million people (or, at any rate, some large and growing number) who need tools to organise stuff, and to help them find stuff. 100 million people who will go right on randomly forming associations, whether the results pretend to be anything other than randomly formed associations or not.

We're dealing here not with the formal space of so-called ontologies and logic, but with the messy, human space of language and meaning. The processes involved are, essentially social. Tools which deny this will fail.

Shelley:

Clay believes that ultimately ontologies will fall to folkonomies, as the latter gain rapid acceptance because of their low cost and ease of use; I believe that ultimately interest in folksonomies will go the way of most memes, in that they’re fun to play with, but eventually we want something that won’t splinter, crack, and stumble the very first day it’s released.

When you design a building to ride out earthquakes with minimal damage, you build in flexibility. Of course, the clever part is in deciding where and how to build in that flexibility. Tagging provids an abundance of flexibility, so much that the building can barely stand up.

But in some ways it works. Delicious is already very useful, on an individual basis, and has some use at a social scale. I follow what other people post about Lisp, Scheme, and Smalltalk for example. These are well defined terms, referring to programming languages rather than highly ambiguous and mis- or differently interpretable concepts. With less well defined concepts, the problems lie as much in the nature of the concepts as in the tagging approach.

Simply aggregating tags a la Technorati as it stands doesn't show a great deal of promise, but it is possible to envisage tools which show tag clouds and, by presenting these to the user, encourage convergence around particular tags. The results will never be complete, never perfect, but always changing, and sometimes useful.

When you have a Web-load of people, things happen from the bottom up. That's the way the Web works.

RDF bibliographies, take 2

A quick response from Bruce, so lets try to get another cycle through before the end of the weekend. More jumbled thought follows, I hope it makes some sort of sense. This will now really will have to stagnate for a while.

Here's take 2. Notes follow.

Continue reading "RDF bibliographies, take 2" »

Elaboration vs. Layering

In an ongoing discussion of tuple spaces is embedded a discussion of the nature of HTTP as a transfer, rather than transport, protocol, or as an application protocol, or …

Patrick Logan makes a point that is easier to get hold of:

The difference is this: the HTTP interface is vague and the Linda interface is specific. Linda has precise, simple semantics. The possible range of behaviors exhibited in Linda-based systems benefit from being layered on *top* of those precise, simple semantics. [Patrick Logan: REST and Linda for Distributed Coordination: Elaboration vs. Layering]

This sounds to me very like the difference between XML and RDF as starting points for building data models. RDF provides much more specific semantics, which means that, assuming applications of RDF stick with those semantics, you can get much further with standard tools. Note that I am not in this taking a stance on REST vs. tuple spaces; I don't know enough about either. It's just that the distinction Patrick expresses strikes me as familiar.

Aside: The "assuming applications of RDF stick with those semantics" qualification is something I hadn't thought about until Jan Grant commented that my proposed approach to using RDF in MDF was at odds with RDF semantics. I still need to learn more about those semantics in order to avoid trampling on them.

As a general rule then, a layer should aim to support further layers by being precise, simple, but flexible, not by being vague and requiring each specialisation to make elaborations.

Bibliographic records in RDF

I've had a couple of conversations recently with Bruce D'Arcus, in particular about the tendency for people working on bibliographic software of one sort or another to get stuck on the BibTeX data model. We are agreed that that data model is basically broken; it was never perfect and it has become less so as time has passed.

As an aside before I continue, I once had cause to try to modify a BibTeX style (chicago, which seemed at the time to have a rather nasty formatting bug). It turns out that BibTeX styles are written in a postfix stack language. If I hadn't learned to use an HP reverse Polish calculator as a kid I would have been completely bamboozled by it. I got somewhere with my modifications, but then I came across a script which would ask a series of questions and build a BibTeX style accordingly, so I was released from having to learn a truly obscure language properly.

Bruce's weblog is probably a good place to watch for news on ongoing efforts to develop new ways of encoding bibliographic data. Most of these are developing XML schemas for the purpose, while there is at least one RDF based effort. Unfortunately many of these are doing no more than expressing the old BibTeX data model in XML or RDF.

Bruce himself is a MODS evangelist. I haven't had time to get my head round MODS, and the fact that I feel I need to spend time doing so worries me slightly. From what I can gather, MODS is largely a simplification of the MARC 21 (MAchine-Readable Cataloging) bibliographic format, and is developed by the US Libraries of Congress. I say largely a simplification because, for example,

Continue reading "Bibliographic records in RDF" »

Data Emergence and not throwing information away

Danny Ayers points out a neat summary by Robin Good of ideas about "Data Emergence".

The phrase "data emergence" really only captures one aspect of the process. Something can emerge only after a critical mass of data has been collected. This creates a tricky catch 22; in order for a database to be "aided through normal, selfish use" (Dan Bricklin, cited by Jon Udell, cited in turn by Robin Good), it is necessary for a database to exist to be used in the first place. The rule is also harder to apply in situations where any "database" such as it is is distributed.

Which is the sort of situation which occurs in supporting the development of (simulation) models of the physical environment and the associated data processing, about which I was talking this week in Bristol (abstract, slides [PDF, 320Kb], for what it's worth) and about which I will talk again next week in Delft (I'll post the slides as updated for that when I get back, and I need to start writing this into a paper for Hydroinformatics 2004 soon).

In that talk I said, "Don't throw information away." Throwing information away is exactly what happens all the time in model development and application activities at the moment (it happens everywhere, but lets stick with my pet case study). Raw data (for example from remote sensing or ground survey) are processed, and the processed data used for some purpose, but the processing steps and reference to the raw data from which the processed data are derived are discarded (or "not kept", but in this case sins of omission and commission can I think be conflated without concern since it is clear to everyone that this information is, or will be in the future, critical).

Often even the raw data are discarded. I think this is true of at least some of the weather radars operated by the Met. Office here in the UK. In this case it is hang over from days of yore when storage on that scale was beyond the reach even of a national meteorological office, but it needs fixing fast.

Closer to home, in the last progress meeting of the Next Generation of Flood Inundation Models project, it was observed that some of the data aqcuired for the project is billed as "geo-referenced", but it is quite unclear what is meant by this and quite unlikely that any reasonably strict definition geo-referenced could be applied. This example draws attention to the fact that linguistic descriptions of processing steps are still not enough; the resulting descriptions will most likely, if they are made at all, be minimal and questionable to the degree that they are actually content-free.

The frustrating thing here is that these processing steps are almost invariably applied with the aid of software, and that software could, if the appropriate frameworks were in place, keep track of this information without placing additional demands on the user (and so without being ambushed by Doctorow's Metacrap straw men).

Think of a persistent undo facility, where each data set carries its own processing history with it. The undo analogy isn't perfect, since in many cases knowing a forward transformation does not imply an ability to reverse it, and (as was emphasised to me after I made over-optimistic claims regarding the rate of decrease of mass storage costs without allowing for increasing demand) keeping each intermediate step is still prohibitively expensive. It is however plausible that checkpoints could be kept, and intermediate stages could be recreated by following the processing sequence forward from the nearest checkpoint.

These trails should be firmly attached to the data set the derivation of which they describe, so that when someone is handed that data set in twenty years time the trail is still there. This might at first seem to be at odds with Earl Mardle's comment on an earlier metadata-related post of mine.

Precisely. And for it to be worth anything, it must also be held separately from the original data. That way, others can contribute to the development of the metadata or annotation, of the document. [Earl Mardle: Metadata As Web Service]

Of course it isn't at odds really. If (no small if, but lets not get stuck here for now) I can refer to a given data set using a URI, then I can say things about it anywhere I want to, whether I own the data set or not, as can anyone else. I can decide which of the statements other people have made about that data set (those which are made visible to me) I want to make use of. But it is essential that if I have access to the data itself, I have access to information about its provenance, and it makes little sense to do other then trust the supplier of the data to supply that.

What *is* the semantic web?

Clay Shirky's recent rant against the Semantic Web has triggered a raft of responses. Paul Ford's make some particularly good points.

One of the real sticking points with this is that in order to address the two important questions about any effort like the SemWeb (namely

  1. Is it useful?
  2. Is it possible?

), we need first to answer the more fundamental question

  • What is it?

, and we don't seem to have a good, shared, answer to that. The W3C provide a definition which doesn't help a great deal. Paul proposes an alternative, simpler definition:

The Semantic Web is a framework that rigidly defines a means for creating statements of the form “Subject, Predicate, Object” or “triples,” in a machine-readable format, where each of Subject, Predicate, Object is a URI. [Paul Ford: A Response to Clay Shirky's “The Semantic Web, Syllogism, and Worldview”]

In actual fact, neither of the definitions really help much in the raging debate, since they essentially avoid controversy by only describing the technology (Pauls definition also presents one view, that of triples, rather than another common "model" of RDF data as a digraph).

But that's fine, because the technology provides a number of things which are not available in any widely recognised, remotely standardised, or (freely available) tool supported form. So the grand visions aren't essential.

Artificial Intelligence? Pah.

Long ago Dad introduced me to a quote, which he believes to have been first uttered by a head of the AI group at the University of Edinburgh.

Artificial it may be, Intelligence it most certainly isn't. [Source unknown]

I have in the past found Google to be quite effective at locating sources of quotes for me. In this case it is of no assistance whatsoever. If anyone has any leads on this, I'd love to know. If not, you heard it here first.

Spread your wings and fly, little meme.

XML, RDF, Topic Maps, and naming things with URIs

Mark Baker posted a very small, very simple example of why you get more from RDF for cheap than you do from XML.

Both links from the comments at Sam Ruby's weblog, an informative dialog from some knowledgeable people. It covers some ground, getting into the issue of merging when identity is defined by the triples a node participates in rather than the URI of the node and the issue of using the URI of an addressable resource to indicate a non-adderssable resource where Topic Maps, I believe, have RDF beat by building merging rules of this type right into the standard and by clearly distinguishing between URI-as-address-of-resource, and URI-as-indicator-of-topic.

Those two things together has brought something slightly better into focus for me. FOAF uses blank nodes to represent people, and identity is based on the email address of a person (or the SHA1 checksum of that email address). That means that simply asserting triples from multiple FOAF graphs into one graph will not result in a properly merged graph, as there will be multiple blank nodes referring to the same person.

In return, it also means that there is no need to come up with some reliable mechanism for referring to people using URIs; that's the bit I hadn't noticed properly before.

It seems to me that RDF could really do with a standard way of dealing with this merging process. The pattern is going to be repeated over and over, and asking every application developer to deal manually with this merging for every vocabulary in which it is necessary aint good enough. But recognising the situations in which such merging is necessary is non-trivial, since RDF has no notion of a subject indicator. Could it be schema driven?

Confusing a thing with a web page about it

I notice the following on Libby Miller's weblog, referring to the RSS+events module.

I'm really pleased to see that it's been updated so that the event is a 'thing' in itself and isn't confused with the webpage describing it … [Libby Miller]

Which I am noting here mostly so I can find it again if necessary, but also because hints at something which has been bugging me about RDF, but which is (I think) more cleanly dealt with in Topic Maps.

If you use an URI to represent a non-addressable object like a person, then it better not be the URI of an addressable object (like a web page) because if it is then you can't distinguish between statements about the person and statements about the web page. In Topic Maps you can talk about the resource at a URI and the thing indicated by a URI unambiguously, so for example I could make statements about my official web page and also make statements using the URI of that page as a subject indicator for me.

Ah. In her previous post Libby comments more fully:

My experiments with RDFical and RSS 1.0 use foaf:topic to separate the RSS 1.0 feed item (with its url) from the event itself (which might have a homepage, but is not itself a url). This issue is analogous to people and urls. People are not webpages though they may have homepages. Events are not webpages, though they may have homepages and other pages about them.

Chris was arguing that the rss link did not have to be a url, but could be a non-url uri, and therefore not confusable with a webpage. Ok, that seems more reasonable, although I worry that people will in fact tend to use actual urls of webpages, especially because that's what RSS is designed for. [Libby Miller]

I think that when it comes to naming models the worry expressed here isn't an issue. There is no existing tradition of having a well defined web page per model (not distinct from modelling tool) or model implementation, so a format which declares that models shall be given URIs which are not URLs of web-retrievable objects should be fine.

Projecting trees in the UI

Raw: OPML Considered H...awful

Working in an outliner-like domain with IdeaGraph, personally I was thinking of using a not-broken alternative such as OML for representing arbitrary trees. But after a year or so of expecting to need it any day, that day hasn't come. Purpose-built XML languages or RDF vocabularies are much more useful. (X)HTML is far better suited to representing structured documents than OPML. RDF is generally considerably better at representing 'outline' structures relating to resources such as documents or URI (lists). Basically a tree-based system falls down when your data isn't structured as a tree, which is most of the time on the web. I've nothing against trees - they're relatively simple to implement, efficient and intuitive to use.

For example, if you're using a tree-based outline, and a channel appears in more than one category, you somehow either have to hack a connection across (can you imagine XLink in OPML?) or duplicate the item. Either way you lose the benefits of using a tree in the first place. But if you use a model based on a general node and arc graph structure (such as RDF) then you can still project trees in the user interface. It's just a separation of model and presentation. [Danny Ayers]

A good article, with sensible sounding arguments. Certainly it's too early to voluntarily accept legacy restrictions on formats used (let's face it; no one, in real terms, uses this stuff yet).

Of more general interest to me was the comment that you can still project trees in the user interface. This is precisely what Corel InfoCentral did before it was given a lobotomy. The model was a graph, the interface presented a classic +/- collapsible tree view. If you expand a node, then its parent appears as a child (all connected nodes are included as children). Expand that child, and your current node appears again. You could do this all day, but you wouldn't, so it works fine.

Let me repeat that, because I think it's important:

But if you use a model based on a general node and arc graph structure (such as RDF) then you can still project trees in the user interface.

Aside: Oh, I've just discovered that the TypePad quickpost window lets me trackback ping the weblog entry I'm quickposting from by selecting it from a list. Cool. Very cool.

March 2009

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31