Wouldn’t it be handy if we could treat all the public spreadsheets uploaded to Google docs as queryable tables in a database? Well, it appears that you can do so, at least at an individual spreadsheet level: Introducing the Google Visualization API.
Over the weekend, I started exploring the Google Visualisation API Query Language, which is reminiscent of SQL (if that means anything to you!). This language provides a way of interrogating a data source such as a public online Google spreadsheet and pulling back the results of the query as JSON, CSV, or an HTML table.
Got that? I’ll say it again: the Google Visualisation API Query Language lets you use a Google spreadsheet like a database (in certain respects, at least).
Google query languages are defined on a spreadsheet in the following way:
http://spreadsheets.google.com/tq?tq=QUERY&key=SPREADSHEET_ID
Although defined, by default, to return JSON data from a query, wrapped in a pre-defined (and fixed?) callback function (google.visualization.Query.setResponse()), it is also possible to display the results of a query as an HTML table (which is “useful”, as the documentation says, “for debugging”). The trick here is to add another argument to the URL: tqx=out:html, so for example a query would now be defined along the lines of:
http://spreadsheets.google.com/tq?tqx=out:html&tq=QUERY&key=SPREADSHEET_ID
Using the Guardian datastore’s MPs expenses spreadsheet 2007-8 as an example, we can write quite a wide variety of queries, which I’ll show below in their ‘HTML preview’ form.
(In a ‘real’ situation, you are more likely to retrieve the data as JSON and then process it as an object. Or, as I will also demonstrate, take the results of the query as CSV output (tqx=out:csv rather then tqx=out:html) and pull it directly into a service such as Many Eyes WIkified.)
The generic URL is of the form: http://spreadsheets.google.com/tq?tqx=out:html&tq=QUERY&key=phNtm3LmDZEObQ2itmSqHIA.
In the examples, I will just show the unencoded select statement, but the link will be the complete, well-formed link.
So here we go:
- show everything – fetch the whole table: select * (in a lot of computer languages, ‘*’ often refers to ‘everything and anything’);
- just show some particular columns, but again for everyone: fetch just columns B (surname), C (first name) and I (total additional costs allowance): select B,C,I
- only show the names of people who have claimed the maximum additional costs allowance (£23,083): fetch just columns B, C and I where the value in column I is 23083: select B,C,I where I=23083 (column I is the additional costs allowance column);
- How many people did claim the maximum additional costs allowance? Select the people who claimed the maximum amount (23083) and count them: select count(I) where I=23083
- So which people did not claim the maximum additional costs allowance? Display the people who did not claim total additional allowances of 23083: select B,C,I where I!=23083 (using <> for ‘not equals’ also works); NB here’s a more refined take on that query: select B,C,I where (I!=23083 and I>=0) order by I
- search for the name, party (column D) and constituency (column E) of people whose first name is Jane or is recorded as John (rather than “Mr John”, or “Rt Hon John”): select B,C,D,E where (C contains ‘Joan’ or C matches ‘John’)
- only show the people who have claimed less than £100,000 in total allowances : select * where F<100000
- what is the total amount of expenses claimed? Fetch the summed total of entries in column I (i.e. the total expenses claimed by everyone): select sum(I)
- So how many MPs are there? Count the number of rows in an arbitrary column: select count(I)
- Find the average amount claimed by the MPs: select sum(I)/count(I)
- Find out how much has been claimed by each party (column D): select D,sum(I) where I>=0 group by D (Setting I>0 just ensures there is something in the column)
- For each party, find out how much (on average) each party member claims: select D,sum(I)/count(I) where I=0 group by D
To create your own queries, just hack around the URIs.
One other trick is to grab a CSV output, rather than an HTML output, and pull it into Many Eyes Wikified, and then visualise it within that environment – so we grab the data (in this case, using select D,sum(I) where I>=0 group byD, i.e. the total amount of additional costs allowance claims by party):
to give this:
and then visualise it in an appropriate way:
So to recap this final case, then, we are running a query on the original spreadsheet that calculates the total additional costs allowance claims per party, and emits the results as CSV. These results are imported into Many Eyes Wikified, and displayed therein.
Now I’m pretty sure that Many Eyes Wikified will continue (how often?) to synch data from a potentially changing data source, which means we should be able to use a similar approach to plot a running total of claims from the Shadow Cabinet Expenses spreadsheet…
…but, at the time of writing at least, it seems as if the publication/privacy settings on that spreadsheet are set such that access via th query language is denied…:-(
Anyway – that was a quick intro to the Google Visualisation API Query Language – so go play… ;-)
PS so what other spreadsheets might make for some interesting queries?
PPS @adrianshort has made a valuable point about how easy it is for a publisher to change the order of rows in a spreadsheet, and hence make a nonsense of your query. (Also, I think the approach I’m taking sort of assumes a simple, regular spreadsheet where row 1 is for headers, then the data, and ideally no other text e.g. in cells below the table describing the data in the table.) So always check… ;-)
PPPS If the first row in the table defines column headings, then there are intervening lines (maybe spaces) before the data starts, putting offset N (where N is a number) will skip that many rows before displaying the data.
Something else I noticed on the order by setting, this can be of the form order by COL asc (to sort in ascending order, which is the default) or order by COL desc ( to sort in descending order).





It can be even more interesting. There’s an open source javascript library called Simile Exhibit with with you can do many things, including timelines and maps: http://www.simile-widgets.org/
Take a look at: http://www.rtvutrecht.nl/dossiers/205847
Everything is dragged in from a Google Spreadsheet!
@Marko – i m a big fan of simile exhibit, but the query language option is even better then simile exhibit. i have a website also using simile exhibit off a google spreadsheet and given the size of the spreadsheet, the load time my website is very extensive which i have observed makes me lose visitors interest. GQL is better since it actively queries the spreadsheet for only relevant data and hence the load times are much more snappier.
Thanks for a great write-up, Tony. I can see this being useful for many things.
A word of caution though if you’re using regular queries against a spreadsheet as part of something that matters. Spreadsheet layouts are more likely to change than database schema so you need to keep an eye on your sources and output to ensure that you’re getting the results you expect. A spreadsheet owner simply transposing two columns could cause all kinds of grief.
Tony
Great stuff – I’m sure I’ll use this in teaching next year. Just noticed a typo: in your link to get CSV, I think the URL should contain tqx=out:csv
Chris
Excellent writeup – I am planning to change my simile exhibit powered website to use this instead.
Btw, in relation to your question about the json callback (“Although defined, by default, to return JSON data from a query, wrapped in a pre-defined (and fixed?) callback function…”) the request format does support a custom callback function using the tqx paramter as follows:
tqx=responseHandler:foo will call “foo” function as follows:
foo({status:’ok’, … });
Many thanks for this. Saved me a bunch of time (just starting using it in a small way on theyworkforyoulocal.com — an early-stage project to make local govt data accessible).
Agree with your comment on later post about data integrity, particularly consistency in naming/unique refs for the bodies concerned, although sometimes this is a problem of the original govt produced data
This is interesting indeed.
I tried with my own spreadsheet,
“http://spreadsheets.google.com/tq?tqx=out%3Ahtml&tq=select+count%28A%29+where+A%3DPHP%3Fkey%3D0AipNn919hx-OdHNhTDl1RExJVlFMUXVwOGUtNjdzanc&hl=en”
but I always got this message:
“Oops, an error occured.
Status: error
Reason: Access denied
Description: Access denied”
Anyone can view and edit the spreadsheet via the key I used in this example, how come I still get access denied ?
@James – there are a couple of permissions fields associated with Google spreadsheets – I don’t remember offhand which you need to set for the query language API to work, but you should maybe check them both?
One is in the “Publish as a web page” Share menu option, the other is in the “Get the link to share” option.