So What Is It About Linked Data that Makes it Linked Data™?

If you’ve been to any confrences lately where Linked Data has been on the agenda, you’ll probably have seen the four principles of Linked Data (I grabbed the following from Wikipedia…)

1. Use URIs to identify things.
2. Use HTTP URIs so that these things can be referred to and looked up (“dereference”) by people and user agents.
3. Provide useful information (i.e., a structured description — metadata) about the thing when its URI is dereferenced.
4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

Wot, no RDF? ;-) (For the original statement of the four rules, see TIm Berners Lee’s Design Issues: Linked Data personal note, which does mention RDF.)

Anyway – here’s my take on what we have… building on my Parliamentary Committees Treemap, I thought I’d do something similar for the US 111st Congress Committees to produce something like this map for the House:

US 111st COngress committees

I reused an algorithm I’d used to produce the UK Parliamentary committee maps:

– grab the list of committees;
– for each committee, grab the membership list for that committee.

That is, I want to annotate one dataset with richer information from another one; I want to link different bits of data together…

The “endpoint” I used to make the queries for the Congress committee map was the New York Time Congress API.

The quickest way (for me) to get the data was to use a couple of Yahoo Pipes. Firstly, here’s one that will get a list of committee members from a 111st Congress House committee given its committee code (it’s left as an exercise for the reader to generalise this pipe to also accept a chamber and congress number arguments ;-)

I get the data using a URL. Here’s what one looks like:
http://api.nytimes.com/svc/politics/v3/us/legislative/congress/111/house/committees/HSAG.xml?api-key=MY_KEY

So given a committee code, can get a list of members. Here’s what a single member’s record looks like:

rank_in_party: 5
name: Neil Abercrombie
begin_date: 2009-01-07
id: A000014
party: D

If I wanted to annotate these details further, there is also a list of House members that return records of the form:

id: A000014
api_uri: http://api.nytimes.com/svc/politics/v3/us/legislative/congress/members/A000014.json
first_name: Neil
middle_name: null
last_name: Abercrombie
party: D
seniority: 22
state: HI
district: 1
missed_votes_pct: 12.81
votes_with_party_pct: 98.27

I can grab a single member record using a URL of the form:
http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/members/{member-id}[.response-format]?api-key=MY_KEY

Now, where can I get a list of committees?

From a URL like this one
http://api.nytimes.com/svc/politics/v3/us/legislative/congress/111/house/committees.xml?api-key=MY_KEY

The data returned has the form:

chair: P000258
url: http://agriculture.house.gov/
name: Committee on Agriculture
id: HSAG

Here’s how I grab the committee listing and then augment each committee with its members:

Although I don’t directly have a identifier in the form of a URL for the membership list of a committee, I know how to generate one given a pattern that will create the URL for a committee ID given a committee ID, and a committee ID. The pattern generalises around the chamber (House or Senate) and Congress number as well:
http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/{congress-number}/{chamber}/committees[/committee-id][.response-format]?api-key=MY_KEY

So I think this counts as linkable data, and we might even call it linked data. If I work within a closed system, like the pipes environment, then using “local” identifiers, such as committee ID, chamber and congress number, I can generate a URL style identifier that works as a web address.

But can we call the above approach a Linked Data™ approach?

1. Use URIs to identify things.
This works for the committee membership lists, the list of committees and individual members, if required.

2. Use HTTP URIs so that these things can be referred to and looked up (“dereference”) by people and user agents.
Almost – at the moment the views are XML or JSON (no human readable HTML), but at least in the committee list there’s a link to a human audience web page.

3. Provide useful information (i.e., a structured description — metadata) about the thing when its URI is dereferenced.
The members’ records are useful, and the committee records do describe the name of the committee, along with it’s identifier. But info that make committee records uniquely identifiable exist “above” the individual committee record (e.g. the congress number and the chamber). In a closed pipes environment, such as the one described above, if we can propagate the context (committee id, chamber, congress number), we can uniquely identify resourceses using dereferencable HTTP URIs (i.e. things that work as web addresses) using a URI pattern and local context.

4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
Yes, we have some of that…

So, the starter for ten: do we have an example of Linked Data™ here? Note there is no RDF and no SPARQL endpoint exposed to me as a user. But I’ve had to use connective tissue to annotate one HTTP URI identified resource (the committee list) with results from a family of other HTTP URI idnetified resources (the membership lists). I could have gone further and annotated each member record with data from the “member’s info” family of HTTP URIs.

The “top level” pipe is a “linking query”. IF I had constructed it slightly differently, I could have passed in a chamber and congress number and it would have:
– constructed an HTTP URI to look up a list of committees for that chamber in that Congress; (this was a given in the pipe shown above);
– grabbed the list of committees;
– annotated with them with membership lists.

As it is, the pipe contains “assumed” context (the congress number and chamber), as well as the elephant in the room assumption – that I’m making queries on the NYT Congress API.

On reflection, this is perhaps bad practice. The congress number and chamber are hidden assumptions within the pipe. The URL pattern that the NYT Congress API defines explicitly identifies mutable elements/parameters:

http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/{congress-number}/{chamber}/committees[/committee-id][.response-format]?api-key={your-API-key}

Which suggests that maybe best practice would be to pass local context data via user parameters throughout the pipework to guarantee a shared local context within child pipes?

So where am I coming from with all this?

I’m happy to admit that I can see how it’s really handy having universal, unique URIs that resolve to web pages or other web content. But I also think that local identifiers can fulfil the same role if you can guarantee the context as in a Yahoo Pipe or a spreadsheet (e.g. Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)).

So for example, in the OU we have course codes which can play a very powerful role in linking resources together (e.g. OU Course Codes – A Web 2.OU Crown Jewel). I’ve tended to use the phrase “pivot point” to describe the sorts of linking I do around tags, or course codes, or the committee identifiers described in this post and then show how we can use these local or partial identifiers to access resources on other websites that use similar pivot points (or “keys”). (ISBNs are a great one for this, as ISBN Playground shows.)

If Linked Data™ zealots continue to talk about Linked Data solely in terms of RDF and SPARQL, I think they will put a lot of folk who are really excited about the idea of trying to build services across distrubuted (linkable) datasets off… IMVHO, of course…

My name’s Tony Hirst, I like linking things together, but RDF and SPARQL just don’t cut it for me…

PS this is relevant too: Does ‘Linked Data’ need human readable URIs?

PPS Have you taken my poll yet? Getting Started with data.gov.uk… or not…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

10 thoughts on “So What Is It About Linked Data that Makes it Linked Data™?”

  1. The HTTP based Linked Data acid test is quite simple. Publish the Generic HTTP URI (its Identifier) of your data object (or resource), and then see if de-referencing the URI results in a structured description of said data object. Basically, this means that the following should be clearly discernible in a structured hypermedia based data representation:

    Entity — Identifier for Subject of Description
    Attributes — one or more discernible characteristics of Subject Entity (left side of an Attribute=Value pair)
    Value — right side of the Attribute=Value pair referred to above.

    As for is this RDF or not, the question is inherently two-fold:

    1. Is your description represented in an Entity-Attribute-Value graph (this is what the RDF data model is a variant of i.e., it just adds URIs for Entity Identifiers)

    2. The data representation of your Entity-Attribute-Value based description (what is de-referenced via your data source HTTP URI).

    Kingsley

    1. Hmmm…. so:

      – where does this requirement appear in the 4 principles?
      – is what I described above Linked Data™ or not?

      Does a JSON object.attribute(value) representation qualify, (or some other take on the JSON object/attibute/value view).

      How about a spreadsheet, where column 1 refers by convention to identifiers (objects), row 1/column names refers by convention attributes, and the other cells are values?

      IMHO, forcing the SPARQL and RDF view may be useful to the priesthood who appreciate the purer view of things (though I suspect the philosophers might get a bit angsty with the Platonic idealism of it all), but the “deep concepts” that are are evangelised as necessary truths are largely meaningless or irrelevant to the pragmatists?

  2. Thanks Tony. Very interesting post. I’ve been reading a lot recently about what does and doesn’t qualify as Linked Data / linked data and this definitely adds to the debate.

    There’s a lot of sense in your approach but I wonder if you could possibly say a little more about why you think RDF and SPARQL don’t cut it? Is it just that you don’t think they are necessary to facilitate data linking?

    Thanks.

    1. I will do at least another post along these lines, but I’d also like to ask something of you (and anyone else who wants to try to answer the question. I’m not trying to catch anyone out here, I’m genuinely confused about this area and I don’t have a clear understanding about what I think Linked Data is, or what I think other people think Linked Data is.)

      Based on your reading, and the understanding you have come to based on it, does the above represent an example of Linked Data in action? As the old style questions put it: explain your answer. If it does not, what changes would you need to make to the representation(s) used in order for it to be cast as a demonstration of Linked Data usage.

    2. Hi Tony,

      Apologies for the delay in replying. The short and honest answer to your question does the above represent an example of Linked Data in action? is that depends. And as you so rightly point out what it depends on is whether or not the use of RDF and SPARQL are mandatory for data to become Linked Data. Which in turn depends on whether you regard the “four rules” of TBL’s Linked Data Design Issues as normative and mandatory and tbh I don’t think you can. In fact TBL states in Design Issues:
      “I’ll refer to the steps above as rules, but they are expectations of behavior. Breaking them does not destroy anything, but misses an opportunity to make data interconnected.”

      So, you may have broken a few rules but you haven’t destroyed anything. And more importantly you have demonstrated Linked Data in action, you have made the data interconnected. So unless W3C produces a normative standard which states Linked Data must use RDF and SPARQL then I would say that what you have here is linked data / Linked Data / linkable data or what ever you choose to call it.

      Or am I just being naive?

  3. Here are my views re. commonly known RDF Data Representation Formats and SPARQL re. Linked Data: They are both implementation details.

    The problem with RDF (Resource Description Framework) is that for historic and somewhat political reasons, its perceived as being analogous to the RDF/XML data representation format.

    RDF is a framework based on an Entity-Attribute-Value graph model. You can represent (markup) data for this model in a myriad of ways.

    RDF simply mandates that the Entity, Attribute, and Values (optionally) are identified using URIs.

    SPARQL enables you to query RDF model data hosted in an RDF-Graph model oriented database (Quad or Triple store) that supports the query language.

    Linked Data, or to be precise: HTTP based Linked Data, arises when you use Generic HTTP URIs (#ghuri-s) as the URI type for Entities, Attributes, and Values (optionally) in the E-A-V model graph, used to describe an Entity (aka Datum, Data Item, Data Object, or Non Information Resource).

    Q: What’s the big deal re. Generic HTTP URIs?

    A: They leverage an inherent piece of ingenuity that’s intrinsic to the HTTP protocol i.e., they provide a hybrid mechanism for Naming (Identifying, Referring To, or Referencing) a Data Object and Locating (Access) its Description, in a variety of data representation formats.

    Q: Where does SPARQL fit into the above?

    A: One approach to delivering a structured description for a data object is to query its host database using SPARQL Protocol (yes its a protocol, language, and result serialization combo).

    Q: What about other data representation formats that are not part of the W3C specs or recommendations etc. Can they be classed as Linked Data bearers?

    A: Of course, the minimal requirement isn’t SPARQL or RDF. Its the generation of resources that bear hypermedia content where structure is based on an EAV model where Generic HTTP URIs are used to Identify: Entity, Attribute, and Values (optionally). Example: recently released OData initiative from Microsoft which is simply an alternative data representation format for an E-A-V model graph.

    Bar any typos, I hope this clarifies matters :-)

    Links:

    1. http://bit.ly/b6YdDc — Linked Data Rules Simplified
    2. http://bit.ly/d2NzLM — What is the Linked Data meme about?
    3. http://bit.ly/eBLv1 — The URI, URL, and Linked Data Meme’s Generic HTTP URI

  4. Hi Tony.

    Yeah for sure, what ‘Linked Data’ is, and what ‘linked data’ is? All a bit messy. I had a quick read above (glossing over the pipes details in truth, as I’m not that savvy). Hope I’m not missing too many ironic subtleties, but to me I think it makes sense that ‘Linked Data™’ requires the RDF as per http://www.w3.org/DesignIssues/LinkedData.html, and your Yahoo pipes stuff is definitely linked data FWIW, though suspect for some it’s also Linked Data™. So all a bit messy. I hope I’m not stating the bleedin’ obvious there :)

    Isn’t the deal really, that if you buy the Linked Data idea, it’s about a consistent approach to wrapping data upon which a consistent approach to getting data out there can be built, with all that entails, e.g tools, workflows etc. Isn’t the problem (arguably :) with your approach above, is that it’s an ad-hoc case-by-case approach which requires clever dudes like yourself to knock up, but upon which is difficult to build systems (human and computer) across institutions. As Jeni Tennison might say, you can’t get patterns established based on this way of doing things, so it doesn’t scale.

    I suspect you’re grumbling ‘even cleverer dudes (which surely don’t exist) are needed for this RDF/SPARQL dream’, and that getting such an ambitious idea of the ground, when it seems so hard to do is never gonna happen. If so, yeah I can see that. I can, rather boringly, sitting-on-the-fence-ways see both approaches have their flaws.

    Apols if I’m off on the mark on what you’re trying to get at here. I’m looking forward to the ‘RDF and SPARQL don’t cut it’ post.

    Ade

Comments are closed.