A dummy’s introduction to linked data (me being the dummy)

This is an edited version of a talk I gave on a panel about linked data and the semantic web at News: Rewired, on Thursday 16th December. The presentation slides can be seen here on slideshare.

Disclaimer: I’m not a technologist. I’m not a programmer. If you’re a geek then this piece isn’t meant for you. It’s for those of us trying to get to grips with the potential of technology and the web for news, politics, business and society, but without too much technical know-how.

What is linked data?

In the 18th century Voltaire wrote that the Holy Roman Empire was neither Holy, nor Roman, nor an Empire. You could say something similar about ‘linked data’. Linked data is neither ‘linked’ – in the way we think of hyper-linking on the web; nor is it ‘data’ – in the sense of numbers or databases. So what is it?

Data as ‘things’

The data part of linked data is really discrete ‘things’. Identifiable things like people, places, organisations, events. You are a discrete thing. I am a discrete thing. In real life there is one you, one me, one capital city called London. On the web there are likely to be many you’s – your Facebook profile, your LinkedIn profile, your flickr pages, pictures of you on other people’s pages, your blog, other blogs about you – you get the picture.

Trouble is, if you’re spread in different places over the web, how does it know it’s you? I’m certainly not the only Martin Moore. There is Daniel Martin Moore, the singer songwriter from Kentucky (who has a new album out). There is Martin Moore the under-20 Ireland front-row rugby player (whose career I’m now following with interest). There is Martin Moore the cellar master from South Africa (I’m jealous of his job). There is Martin Moore QC. There is Martin Moore kitchens

I’m not any of these Martin Moores. I am me. I like to think I am, in the words of Chuck Palahniuk in Fight Club – like all of us – ‘a sacred, unique snowflake’.

But how does the web know this? How does the web (and therefore people searching for me, or trying to recommend things to me) know who I am?

Well, if stuff about me is put on the web as linked data then I am given a unique identifier. A sort of human ISBN. A web snowflake. So that, whenever I publish something, or someone publishes something about me, then the web knows it’s me and not one of the many other Martin Moores out there.

Linking as grammar

Now we move onto the ‘linked’ bit of linked data. A hyperlink is a dumb link – in the sense that it just says ‘click on me and I’ll take you to another web page’. It doesn’t know why these pages are linked together, or what the relationship between them is, you have to work that out from the context.

If you publish it as linked data then you explain the relationship between the two things you’re linking. This person wrote this article. This organisation launched this product. This event happened at this time. It’s a bit like grammar, where you have subject – verb – object, i.e. John kissed Mary. (In linked data language this is called a ‘triple’ though being non-techie I prefer to think of it in grammatical terms.)

Suddenly, instead of having an indistinguishable soup of stuff on the web, you have lots and lots of distinct entities with clearly defined relationships.

Good reasons for publishing in linked data

So what? I hear you say. Why should I care about this in my day job? Well, there are a bunch of reasons why this could be a big deal. Here are just a few:

Publish in linked data and you can make your site much richer – both in terms of links and, potentially, in terms of automatically generated content. The BBC’s natural history pages are filled with interesting stuff about animals – including video clips, information about distribution, habitats, behaviours (e.g. see this one on the lion – complete with great sound clip of a lion growling and snarling). But only some of this content is produced by the BBC (mostly the video). Lots of the other information is automatically sourced from elsewhere – sites like WWF and Wikipedia. By combining it all together the BBC has pages that are far deeper and more threaded into the web.

This can have a great knock-on effect on where your page/site comes in search engine results. The BBC’s natural history pages, for example, which used to come somewhere way down the rankings, now appear in the top 10 results on Google (when I typed ‘lion’ into to google.co.uk earlier this week, the BBC page came fourth, while ‘aardvark’ came third).

Linked data can also help with sourcing. Now that lots of primary data sources are being published as linked data (e.g. on data.gov.uk) you can link directly back to the raw figures that you’re writing about. So if you write a piece about the rise in cars thefts in south Wales, people can follow a line straight from your piece to the Home Office data on which it was based.

It can improve accreditation. By providing clear, consistent and unambiguous information about who wrote something, who published it, when it was published, where it was written etc. then the producer gets better credit, and the person reading has the tools to judge its credibility.

It can make searching really smart. Let’s say you wanted to search for all the composers who worked in Vienna  between 1800 and 1875. Right now that’s pretty tricky – or at least it might take a bit of digging to work out. But if the information was published in linked data format you could just search for all composers who worked in Vienna between 1800 and 1875. Because the web itself becomes a sort of distributed database.

Finally, but perhaps most importantly, linked data can create an environment that enables innovation and the creation of new services. Suddenly it becomes possible to build really smart stuff based on the way in which things are linked together. The BBC’s World Cup site did just this in the summer of 2010. Publishing huge amounts of information – more than any team of journalists could put together themselves – sourced from lots of different places. The New York Times has now publishes in linked data and encourages people to build new stuff to leverage it. There is a tutorial for building a web app to show NYT coverage of a school’s alumni, for example – see a finished app here.

Other companies are starting to use linked data and other semantic information to build recommendation engines (like GetGlue). People can start adding value to data that you would never have thought of.

A final warning

The basic premise of linked data is wonderfully simple. You link discrete things together in such a way that we know the relationship between them (subject-verb-object). Once linked, the web then starts to have an artificial intelligence of its own.

But putting this basic premise into action is more complicated. Publishing in linked data for the first time is not for the faint hearted (we now publish journalisted.com in linked data so learnt for ourselves how complex it can be). You can find yourself quite quickly mired in the intricacies of linked data formats, vocabularies and many acronyms.

Though there are ways to move towards linked data without plunging in head first. Just publishing structured metadata is a very good start (for which there are various plugins for open source CMSs like WordPress). Microformats are also a much easier entry point for those wanting to introduce some metadata to what they publish (e.g. hNews for news).

Linked data is remarkable. It’s also a little scary. But the sooner people understand its potential and start making their information more ‘semantic’, the healthier and more navigable the web will be.