Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTM(“”,”table”,N) will scrape a table from an HMTL web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

139 Responses to “Data Scraping Wikipedia with Google Spreadsheets”


  1. 1 redbaiters October 14, 2008 at 10:46 pm

    My liege,

    Wow, I think I saw what I think I saw. We really have to start talking about visualizing data in a WPMu/MediaWiki environment. I will be blogging some semantic stuff I really don’t understand shortly, because you inspire me to no end.

  2. 2 Brian October 14, 2008 at 11:08 pm

    Oh my goodness, I’m experiencing this familiar “Tony just did something I only half-get but am totally excited about” sensation…

    I gotta get me to a remedial digital literacy class, like, yesterday.

  3. 3 Mike October 15, 2008 at 8:03 am

    tony. good shit. mike

  4. 4 Gaurang October 15, 2008 at 10:42 am

    This is really awesome, now the only thing left to do is use http://f33d.in and get updates via twitter :)

  5. 5 tycho October 15, 2008 at 10:52 am

    This is amazingly cool; thank you!

  6. 6 Tom October 15, 2008 at 11:37 am

    That made my day.

    I wish I’d have realized that was possible when I did this with Jim Groom. It would have saved me a lot of time and it’s just such an elegant way to do it.

    Thanks.

    Tom

  7. 7 Joss Winn October 15, 2008 at 12:43 pm

    Brilliant and really transparent. Following your example, now anyone could do this if they wanted to (as I have done). Thanks.

  8. 8 stephanea October 15, 2008 at 12:53 pm

    Thank you for sharing. That will be very usefull.

  9. 10 Matty Fo October 15, 2008 at 1:36 pm

    Amazing mashup. More importantly the =import function in Google spreadsheets is an amazing tool for building data sets for visualization and mashup projects, thanks for some killer knowledge.

  10. 11 Gary October 15, 2008 at 1:39 pm

    Been reading your blog for a while and this just made me shake my head… am amazed and do agree with Brian thanks!!!

  11. 12 Peter T - Webshop October 15, 2008 at 1:45 pm

    Very smart, a great example of using previously disjointed assets and creating a new way of collecting and sharing data. Thank you.

  12. 13 Katrineholm Uncensored October 15, 2008 at 3:03 pm

    Tony, you did a fine thing indeed! I’d like to see, for example, more transparency in our local government here in Katrineholm, Sweden where all documents (invoices included) with the exception of those involving personal privacy are online. Now, you offer interesting possibilities for information gathering and online publication via Wikipedia and then with Google Docs.

    Really, you did an excellent job Tony. Most impressive.

  13. 14 Juan Pablo Pincheira October 15, 2008 at 3:11 pm

    This is really cool, thanks for sharing! :-)

  14. 15 Jon October 15, 2008 at 3:16 pm

    Awesomely nifty! Really good stuff.

  15. 16 Marius Hanganu October 15, 2008 at 3:57 pm

    Really cool stuff. This is what web 3.0 should be all about. Integrating services.

    The disadvantage in having so many pieces is that you have little control over how the flow should update the end results. If you want to publish data real time, you’d want to see it as fast as possible on google/yahoo maps. This is where some more work is needed.

  16. 17 jq October 15, 2008 at 4:12 pm

    Well done. Just a minor correction. This:

    “consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a Yahoo map.”

    should be:

    “consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a GOOGLE map.”

    Or Yahoo then Google map.

  17. 18 Jean-Guy October 15, 2008 at 4:45 pm

    Tried duplicating your example but for some reason i always get a parse error ?

  18. 19 chrisco October 15, 2008 at 4:55 pm

    Very cool, thanks for sharing :)

    Question/request:

    I would like to do something like that with data from Yahoo Finance (and/or other source) for end of day market data (indexes, trading volume, advancing and declining issues, new highs/lows, etc.).

    Anyone can/care to help me with that, please get in touch through the contact form on my blog (linked to in this comment, I think, but if not go to http://www.chrisco.us). Would be a paid freelance project, of course.

    Thanks,
    Chris

  19. 20 Edward Ho October 15, 2008 at 6:22 pm

    That’s some really excellent work. Great job on the write up as well.

  20. 22 learningspanish October 15, 2008 at 8:53 pm

    Wow… I think my head just exploded….

    That’s impressive. I didn’t realize you could pull data in that way from a webpage in Google Spreadsheets. Nice implementation to get it to Google Maps.

  21. 23 stacey October 15, 2008 at 10:21 pm

    how do you make sure the data you are using stays current? for example, if i took it from another website and put it on a map.

  22. 24 Tony Hirst October 15, 2008 at 10:48 pm

    First of all – thanks for all the comments folks – and apologies for the delay in replying; I’ve been offline traveling to and from – and delivering – a presentation on mashups today… :-)

    @katrineholm Uncensored – does your council have any web pages we could start to scrape?

    @chrisco – I’ll be posting a follow up post in an hour or so that may be of interest?

    @jq ta :-)

    @stacey – the google spreadsheet polls the original HTML page every so often, so updates to the page will update the data in the spreadsheet. (Of course, if the structure of the page is changed – eg by adding or removing another table in the page before the table you want to pull out – the import function will try to import the now incorrect table; screenscraping based mashups like this are always quite brittle in that a change in the page can break the way you try to get data out of it; which is why APIs are better…:-)
    The Yahoo pipe will also cache the data being pulled from the spreadsheet in the csv file.
    So whilst in an ideal world, a change to the wikipedia page would immediatley be reflected each time you refreshed the map, there is “cacheing induced latency” in the google spreadsheet and the Yahoo pipe which could mean the data is potentially minutes or hours out of date… but hopefully no more than that…

  23. 25 Greg October 15, 2008 at 10:51 pm

    As I understand it there are two ways: when you want to update something in google spreadsheets you have to click on the cell with the calculation and press enter, so that it refreshes, or use a cron job to run a script that fiddles with the spreadsheet so that it updates/grabs fresh data (don’t ask me if/how that would work)

  24. 26 Jon October 15, 2008 at 11:50 pm

    (I just posted this on Brian’s blog, but thought I might as well post it here, given the plaudits that this is attracting…)

    Of course, superficially this is fine. But if you start to look at the data, it’s rather problematic. I was surprised to see that Manchester was listed as smaller than Bristol, of all places. It was only by going to the Wikipedia page that I understood the way in which the statistics were gathered.

    In short, a lot of information has been stripped out in this process. I don’t see much in the way of gain. And in fact, it’s become very misleading.

    Where’s the upside?

    OK, call me a cynic, but…

  25. 27 Jon October 15, 2008 at 11:57 pm

    Or to put it another way (and in case I didn’t sound grumpy and fuddy-duddy enough the first time), the result is a USA Today style graphic, with all the apparent virtues of clarity but the real vices of decontextualized data masquerading as information. Worse, it’s framed as a multiple-choice “pop test” as though it had some kind of educational value. It doesn’t. It’s miseducation.

  26. 28 Tony Hirst October 16, 2008 at 12:17 am

    @Jon the intention of the post wasn’t to produce an infographic, it was to document a mashup recipe. I was looking for a data table with a list of locations that I could scrape using a Google spreadsheet, then output into a geocoding Yahoo pipe and the wikipedia page was the first thing I came across that fitted the bill…

    Your comment makes an interesting point about the way data gets used, abused and inappropriately visualised, though…

  27. 29 Jon October 16, 2008 at 12:24 am

    Well, your mashup recipe provides an excellent case in point of the ways in which data gets used, abused, and inappropriately visualized.

    The problem is not with the Wikipedia page, but with the way in which the information it contains has been repackaged.

    Again, in devil’s advocate role: where’s the upside here?

  28. 30 Tony Hirst October 16, 2008 at 12:41 am

    @Jon “The problem is not with the Wikipedia page, but with the way in which the information it contains has been repackaged.”

    Agreed – as I just posted in a reply on Brian’s Abject blog, I should have explicitly stated the provenance of the data in both the spreadsheet and each description element of the feed generated in the pipe.

    Re: the visualisation being a crap one, my original intention was to visualise the actual population data in some way (I thought about something like this: http://ouseful.open.ac.uk/blogarchive/013580.html then thought better of it), but in the end I ran out of time… As I have done now – 1.40am and it’s been way too long a day to argue any more… ;-)

  29. 31 nzhappygirl October 16, 2008 at 7:31 am

    I second Brian:digital remedial literacy class for me too, because that looks seriously cool.

  30. 32 Chloe October 16, 2008 at 8:54 am

    It would be neat to export the chart or map as an image, and import it into the original Wiki page, so that whenever someone updates the table on the Wiki, the map & chart are automatically updated too. I’m sure there are services which will take a screenshot of a web page. I don’t know how to export the chart, or if Wikipedia allows images to be on another site.

  31. 33 Tony Hirst October 16, 2008 at 9:49 am

    @chloe I think you just found another step in the chain – there is a Google maps static image API ( http://code.google.com/apis/maps/documentation/staticmaps/ ) that lets you craft a URL for a static image (gif, i think?) for a map that can be embedded in a page in the normal way (“img src=” etc)…

    No time to try it now myself – but if the lazyweb fancies having a go, please post a link back here :-)

  32. 34 Sarsen56 October 16, 2008 at 11:10 am

    I think the phrase is ‘well impressed’. The bad news is I just wasted two days doing some tedious longhand stuff! Thanks for this post.

  33. 35 portrait111 October 16, 2008 at 11:17 am

    Hamarja,aisob ki lekhcos?kicui to buzte parlam na!!

  34. 36 utellit October 16, 2008 at 2:22 pm

    I love this. Learning is so fun! Thanks:)

  35. 37 christophermar October 16, 2008 at 2:24 pm

    Very cool mashup. Thanks for the great writeup.

  36. 38 dave October 16, 2008 at 3:44 pm

    I think it’s worth pointing out that while Google’s importHTML is very cool, for many purposes selecting, cutting and pasting a table from a website into a spreadsheet (Excel, OOCalc, or Google Spreedsheets) in the standard manner will the do the job.

    While less gee-whizz, it’s a bit more approachable, and a strangley overlooked ability. (There are some odd corner cases, particularly if you use Microsoft products, but simply knowing it’s possible usually means you’ll get there in the end. I find Fireox plus OpenOffice gives the best results and most powerful results. A quick Google suggests there’s a Firefox extension for people who find themselves needing to do more)

  37. 41 Jon October 16, 2008 at 4:01 pm

    Tony, I’m sorry you see this as arguing.

    Again, however, it would be great if someone, anyone, could tell me the upside.

  38. 42 Tony Hirst October 16, 2008 at 4:17 pm

    @jon i see it as an argument in the sense of – you know – a discussion with different positions that is capable of raising different perspectives and moving the, err,argument on…;-) And the upside is that it was just a sort of coherent way of demonstrating lots of separate techie mashup tricks…

    @dave the problem with cur and paste is that if the original table changes, your cut and pasted data is obsolete (unless the cut and paste cuts’n'pastes a reference to the table that is then imported behind the scenes). A comment on the follow-up post to this one – http://ouseful.wordpress.com/2008/10/16/visualising-financial-data-in-a-google-spreadsheet-motion-chart/ – attempts to work out just how frequently the spreadsheet does update the data.

  39. 43 Brian October 16, 2008 at 5:16 pm

    Tony – I’m glad you mentioned that you had posted a comment on my blog – I had yours and three or four others inexplicably in a junk folder, which I hadn’t checked in weeks.

    Jon – I took a crack at describing what I see as the upside in response to your comments on my blog, and linked to an article I wrote last year… I suspect it will fail to convince you, but I hope it will give a sense why this is interesting to some of us.

  40. 44 mdg October 16, 2008 at 6:03 pm

    -This could also be useful link http://www.trendrr.com
    either port data out or into it

  41. 45 Jon October 16, 2008 at 6:46 pm

    Well, once you can tell me what’s the upside of these “techie mashup tricks,” then that would indeed be “kewel.”

    It would be marvellous if the argument could move on, but it doesn’t seem to me that you are willing to look at the issues here.

    It’s shiny, shiny. And worse than useless: it’s positively unhelpful.

    No wonder people’s heads are exploding. It helps if you don’t actually think about what’s going on.

  42. 46 Ivan October 16, 2008 at 6:47 pm

    Jaw-Dropping.

    It’s almost like this web of interconnected services will actually happen after all…

  43. 47 Raj Kumar October 16, 2008 at 7:54 pm

    hey so much confusion can u clarify please?

  44. 48 Tony Hirst October 16, 2008 at 9:08 pm

    @Jon “It would be marvellous if the argument could move on, but it doesn’t seem to me that you are willing to look at the issues here.”

    I’ve so lost the thread now… anywhere, here are a few answers (hopefully) of a sort:

    1) no provenance – agreed; I touched on that in a previous comment and on Brian’s blog; at a more substantive level, on the point that mashups strip provenance info and context away from data – agreed: we do have a “metadata maintenance” issue here. I’ll come back to this in a moment to suggest that there may be situations where that’s okay… (err, thin ice up ahead, I suspect ;-) But generally, yes – agreed: it’s an issue; only not here, not now. This is where we try to enthuse people – then we can get on to the issues…

    To pull a quote of yours from the comment thread on http://weblogs.elearning.ubc.ca/brian/archives/048641.php “Um, what is the point, then? To do something that’s “kewel”? All surface and no substance? All process and technical trickery and no actual thought or context?”

    2) Part of the point was to just document for myself some whizzy techie wiring for a workshop session I ran yesterday, and rehearse a demo to finish off the session. I’ve sat through that many dull cubed presentations, I’m trying to develop my own presentation skills (a long hard slog if you’ve ever suffered one of my presentations) and figured for the w/s I needed to try and finish with some buzz in the room. Shiny shiny…. but that’s okay, because no-one remembers anything from a presentation, just whether they came away happy, dazed, bored or enthused.

    3) Value generating and information displaying enterprise mashups are arguably on the way (e.g. http://www.forrester.com/Research/Document/Excerpt/0,7211,44213,00.html ); and one reason (IMHO) for this is that the barriers to entry in terms of usability are being lowered by the experimentation that’s going on in the public and beta web. Unfortunately, most people have no idea about what the hell a mashup is – not even the very idea; (this R/W/W post provides one way of trying to explain what mashups are about in an accessible-ish way: http://www.readwriteweb.com/archives/forrester_enterprise_mashups.php ) So the demo was in part to try to get across some the relative easy (and getting easier) process techniques available for putting these things together (pipelines, data transformation, etc), and leave me with an opportunity to say: “so that’s a mashup” at the end of the session.

    I didn’t touch on stripping away the context and maintaining provenance info in the post because that would have BORING and is best saved for a later day – when people are at least thinking about what the consequences of mashups are having seen one possible example.

    @utellit told it as they saw it: “I love this. Learning is so fun! Thanks:)” Good – yes – learning should be fun; rarely is; my view? get enthused and then do the hard work of really getting to grips with it, because you’ll be more motivated…

    4) Most people are turned off by techie stuff, or techie stuff being “too hard” – how many people do you know who don’t have a concept of right-click, or it’s benefits? So I tried to tell a story that embedded some how to’s, and if you abstract away from it and start to think about what’s going on, there is some deeper stuff in there; like what sort of processing is going on in a ‘mashup’, and what the risks are…

    5) Thin ice here: in an enterprise, where the data is hopefully quality data(?!), you can hopefully trust it; and if you learn how to mash it up properly, you can get value from it. The provenance is assumed – there is still a risk wrt loss of context though. Related story – you see lots of web stats reporting averages (averages? wtf? MEANS…); and this are often a total waste of time – the info is in the distribution, and the distribution is often not normal… Context gets dropped from data all the time… and the info processing that goes on is oftentimes lossy… but that’s not a problem if the info doesn’t mislead you into making a bad decision…

    6) as we generate more and more data, we’re gonna have to start learning about visualisation. Many people don;t get the idea of uncovering relationships in data through visualising it; we have to start somewhere with the visualisation education programme; maps are good for that… we have to start somewhere… and if the visualisation is a bad one, in viz 101 that’s maybe not so bad because the idea is to get across the idea of doing visualisations at all. I remember physics at school: each year we were told to forget what we’d done the previous year because it was just an approximation to the truth – and we’d be doing the truth this year…

    And finally, as I understand “kewel”, it’s a way of making fun of (people who use) the word cool who are maybe not the sort of people who can get away with saying cool… so I said “kewel” to suggest that I suspected people would get off on the shiny, shiny of the post… which is ok, because the path to enlightenment is a long one, and it’s important to enthuse people about this stuff in the beginning…

    Phew – too long a comment – If the above doesn’t address any of your points, could you try expressing them in a different way, ‘cos I’m obviously not understanding them…

  45. 49 Chris McAvoy October 17, 2008 at 12:15 am

    Tony, got your comment on my blog. Thanks for the inspiration. I ended up using Pipes to clean up a feed of market information for the game Eve Online, fed it to a Google Spreadsheet, and use it to calculate profit margins on the production of pretend ships and pretend guns.

    Here’s my pipe and here’s a snapshot of the spreadsheet it feeds.

    Not really a *real* application, but its certainly useful within the pretend world of Eve.

    • 50 Aundreaus Fortiz July 30, 2009 at 8:42 pm

      Chris: I love your pipe, not sure how to use it yet though. I started today thinking I would find a way to get up-to date prices for materials needed for POS operation.

      I think where i’m stuck on is where to find the typeID of the item i’m looking for. My spread sheet is set up at the moment but not sure how to pull the info in.

  46. 51 Jon October 17, 2008 at 8:36 am

    Tony, I think that the basic issue is that what you regard as “BORING” seems to me to be rather fundamental. Indeed, it seems rather clearly to be a way of avoiding asking the question of “the consequences of mashups.”

    Yes, you’ve given an example. And I’m now asking about the consequences of this process. I think that, certainly for this example, they are uniformly negative. Which is why I was asking if there were any upside to this representation of a partial, decontextualized dataset in a new environment. Again, I still see no advantages, while I see many, many disadvantages.

    No doubt you’ll think I’m exaggerating, but I think that this post, and the reaction to it, is profoundly discouraging. I’m sorry to be the drag on the “kewel” party by talking about what’s so damn “BORING.” But it seems to me to be important. I’ve written up some further thoughts here.

  47. 52 Jon October 17, 2008 at 1:40 pm

    And perhaps finally, at least for now, a quick response to your points here…

    1) I think the issue is about more than provenance, though providing provenance would certainly be a leap forward. We’re agreed on that. We also agreed that at least some mashups can “strip provenance info and context away from data.” But I should make it clear that I’m not against mashups per se. Some can definitely enhance rather than degrade both data and understanding. But I do think that this question (as to whether there is enhancement or degradation, or more generally the effects of the various transformations undergone in a mashup process) needs to be front and central at all times. It would seem that that’s our basic disagreement: you want to postpone that until some unspecified later date; for now it’s “BORING” and beside the point. Again, I strongly disagree. Postponing such questions is (as I say on my blog) a textbook case of data illiteracy. Moreover, I’m disappointed that enthusiasm is allowed to overwhelm critique so easily.

    2) I like interesting presentations, too. I disagree that thinking critically is necessarily boring. And I don’t like presentations that are whizzy but lack substance.

    3) I’d like to believe that “value generating and information displaying enterprise mashups are arguably on the way.” I think, however, that the links you provided are rather worrying examples of quite the opposite trend: “there is plenty of money to be made selling mashup platforms. . . . Those vendor strategists that move quickly, plan a mashup strategy, and build a partner ecosystem will come out on top”. Ugh. That’s certainly a different definition of “value” than the one I would hope might come out of such tools. Ugh. And learning should be fun. But all that is fun is not learning.

    4) Techie stuff can certainly be bewildering. It makes some people’s heads explode. Hence the vital importance of critical data literacy, precisely so that people “start to think about what’s going on, there is some deeper stuff in there; like what sort of processing is going on in a ‘mashup’, and what the risks are.” I’m all for this.

    5) See above for my fear as to what you may mean by getting “value” from data. Meanwhile, my more basic point is that it’s not so much a question of having quality data or not: I’m happy to accept that for this purposes the Wikipedia data is of sufficient quality. What’s at stake is the ability to interpret that data. Again, that’s what I consider to be the realm of data literacy. I’m sure that some mashups, and some transformations, aid in such interpretation. Others do not. But figuring out which is which is, again, vital. It can’t be postponed.

    6) Finally, I’m all for a “visualisation education programme.” But again, my comparison with the USA Today visualizations is to point out that such education needs to have critique at its core. Not necessarily enthusiasm. Or rather, surely the point is to encourage enthusiasm for critique. That’s what I see my job, at least, as being about. More broadly, as I say, it’s what I see as being the role of the university.

  48. 53 Jon October 17, 2008 at 1:43 pm

    Ach, I just wrote a long comment in response to your six points. It seems to have got lost somewhere. Oh well.

  49. 54 Tony Hirst October 17, 2008 at 1:55 pm

    @jon it was being held as spam for some reason (not my doing… not sure I understand how WordPress makes its spam detection decisions?); should be in the comment stream now…

  50. 55 Jon October 17, 2008 at 4:37 pm

    Tony, no probs. No conspiracy theorizing here!

  51. 57 Maniquí October 17, 2008 at 10:57 pm

    When I was kid, I did this kind of maps when I was at school, by using a table on some schollar book and a map. Who didn’t? What’s new? The auto-magical update through the pipes?

    From the linked page at Forrester:
    “The Mashup Opportunity.
    How To Make Money In The Evolving Mashup Ecosystem.
    Download and print PDF immediately. Price: US $775″

    They will certainly make money if the sell an e-book at that price. :|

    It seems the Web is becoming more and more a cash register, an entertainement place, than a place to share documents and scientific data. Nothing wrong with that, my job at work is creating commercial/entertaining sites for small/medium business.

  52. 58 Dan Murray October 18, 2008 at 12:20 pm

    Really interesting. I”m looking for ways to integrate public data into my companies’ data warehouse. Thanks you for posting this.

  53. 59 playoutintelligence October 19, 2008 at 3:47 am

    BTW: you can do that with MS Excel since Office XP (I think). Data->Import External Data->New Web Query. Click on one or more of the arrows to switch it to a green checkmark and press the “Import” button.

    Works even with password protected websites, Excel will ask you for your login data when you open the spreadsheet then.

  54. 60 Mano Marks October 29, 2008 at 12:37 am

    So. Awesome.

    Do you mind if I show this off to people?

  55. 61 pop October 29, 2008 at 3:06 am

    If I have name and address only(Japan Language) , I would like to show location how can do. I try to do many an hour but Y Location not appears.
    Plz Help Me –

  56. 62 plerzelwupp October 29, 2008 at 6:42 pm

    Really great.
    But some cities aren’t on the right place(?)

    Same problem to me with this Article:
    http://de.wikipedia.org/wiki/Liste_der_gr%C3%B6%C3%9Ften_St%C3%A4dte_der_EU
    and the result:
    http://pipes.yahoo.com/pipes/pipe.info?_id=Msa1t9Ol3RGccWE1rbQIDg

    seem, that the missing cities lay over the last found city

    Does anyone have an idea to fix this problem?

    Many greetings
    plerzel ;-)

  57. 63 Esa October 31, 2008 at 10:04 pm

    Stunning!

    You made me to study geocoding by Spreadsheets.

    Yes, you can geocode on Spreadsheets
    http://apitricks.blogspot.com/2008/10/geocoding-by-google-spreadsheets.html

    Still you need Pipes for KML. Funny.

  58. 64 plerzelwupp November 1, 2008 at 2:42 pm

    Thank you ESA – Great
    pipes find all cities on the right places :-))

  59. 65 DoctorDave November 1, 2008 at 3:59 pm

    Has anyone figured out how to use a published Google spreadsheet as an external data source for Excel?

  60. 67 plerzelwupp November 1, 2008 at 9:04 pm

    thanks again. I wrote a german article in my blog about that.
    For those who are interested…..

  61. 68 plerzelwupp November 1, 2008 at 9:05 pm

    thanks again. I wrote a german article in my blog about that.
    For those who are interested…..
    http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/

  62. 69 Esa November 1, 2008 at 9:54 pm

    Great, plerzelwupp

    I can see that you geocode by Spreadsheets but do the KML conversion by Pipes.

    MANO!

  63. 70 plerzelwupp November 2, 2008 at 8:10 pm

    thank you, ESA – I mixed your proposal (geocode by google) with this one (kml by pipes) …. the pipes location extractor works better, when you feed it with longitude and latitude. Google nearly found every of those 52 cities (except of four). The last 4 cities where found, as I feeded google with city and country (not only city).

    Thanks again, Tony and Esa – and many greetings from germany

  64. 71 pop November 3, 2008 at 3:54 am

    Help ME PLZ, I do not known why map not shown , Please recommend Plz !!!!
    Thx so lots

    Pips has problem
    http://pipes.yahoo.com/pipes/pipe.info?_id=5rBrVlqp3RGmRzxlbbsjiw

  65. 72 pop November 3, 2008 at 4:04 am

    I confuse a thing , My pipe show Lat and long ,but when I run Pipe the map not appear , just show list.
    How can I fix this problem.

    Plz Help me, (:

    My pipe which has problem
    http://pipes.yahoo.com/pipes/pipe.info?_id=5rBrVlqp3RGmRzxlbbsjiw

  66. 74 Richard Fahey November 8, 2008 at 9:40 pm

    Thanks so much for this awesome post. The only issue I have is that the wiki I want to scrape data from is behind a firewall and cannot be accessed without signing in. I wonder is there anyway out of this because it’s RSS feeds etc cannot be accessed externally. I’m looking at some mediawiki extension that might push the wiki page data externally so I can then use google spreadsheet, yahoo pipes etc. Any ideas?

  67. 75 Tony Hirst November 9, 2008 at 12:52 am

    @richard
    It won’t be approved of, but if you have a DMZ server that sits on the internal network (ie can see internal URLs) and also faces outside, you can set up a simple proxy that will relay the feed into the public domain…

    Of course, I can’t recommend such practice ;-)

    If the login to the wiki is via http, then you may be able to pass user id and pwd in via URL. That is, if you can get a feed out using something like http://example.com/feed.xml?id=myID&password=secret

  68. 76 Pamela Fox November 27, 2008 at 12:21 pm

    Great post! This inspired me to write a blog post about my personal technique for geocoding spreadsheets data:

    http://otherfancystuff.blogspot.com/2008/11/geocoding-with-google-spreadsheets-and.html

  69. 77 Denis February 10, 2009 at 1:03 pm

    http://www.alpheta.com – please take a look if you’re interesed in data scraping

  70. 78 ernie espinoza February 13, 2009 at 11:33 pm

    how do you copy and paste the url in to the query ()in google doc’s. it wont allow me too

  71. 79 mike February 21, 2009 at 9:04 pm

    Can you pass pwd and user ID in url?

  72. 80 Shahriar Hyder April 16, 2009 at 2:30 pm

    Here is a post regarding techniques for ‘Scraping your way to RSS feeds’ albeit in a non-programmatic (layman) way:

    http://technosiastic.wordpress.com/2009/04/08/scraping-your-way-to-rss-feeds/

  73. 81 Olga Revilla June 11, 2009 at 9:32 am

    Hi,

    I have tried to do the first step (importHTML) in Google Docs and it does not work with comma (,) separator, but changed to semi-colon (;) and worked fine.

    Great article, congratulations!

  74. 82 Edward Metz August 27, 2009 at 4:56 am

    I wanted to ask if someone out there might be able to demonstrate how I could introduce into my map a third data point to display. Right now I have the location and name displaying but also need the date if at all possible. Any ideas?

  75. 83 Stephen Akins November 17, 2009 at 3:59 am

    Well done!

    I wrote a blog post doing something related; how to create a server monitor using a Google Spreadsheet and a little PHP.

    http://stephenakins.blogspot.com/2009/04/google-docs-server-monitoring_8546.html

    I have some ideas about how to create some other stuff using approaches like these.


  1. 1 Wikipedia : Google Docs : Yahoo Pipes Mashup - ../learninglab/joss/ Trackback on October 15, 2008 at 12:33 pm
  2. 2 Mashing up the PLE (Tony Hirst) « Doug Clow’s Imaginatively-Titled Blog Trackback on October 15, 2008 at 2:36 pm
  3. 3 Infovore » links for October 15th Trackback on October 16, 2008 at 12:00 am
  4. 4 Visualising Financial Data In a Google Spreadsheet Motion Chart « OUseful.Info, the blog… Trackback on October 16, 2008 at 12:05 am
  5. 5 The NAvAP » Blog Archive » Example of a web mashup at work Trackback on October 16, 2008 at 12:12 am
  6. 6 links for 2008-10-15 « Simply… A User Trackback on October 16, 2008 at 12:33 am
  7. 7 datatube « Greg Is Overcritical Of Himself And Lots Of Other Things. Trackback on October 16, 2008 at 12:55 am
  8. 8 On est une bande de potes, on se scrape la gueule | taggle.org Trackback on October 16, 2008 at 7:52 am
  9. 9 The Great Geek Manual » Link Round-Up: October 15, 2008 Trackback on October 16, 2008 at 2:09 pm
  10. 10 The Wikipedia Blog » Blog Archive » Scraping Wikipedia tables with Google Spreadsheets Trackback on October 16, 2008 at 4:09 pm
  11. 11 =GoogleLookup: Creating a Google Fact Engine Directory « OUseful.Info, the blog… Trackback on October 16, 2008 at 11:51 pm
  12. 12 » From Wikipedia to Google Mashup mapkist Trackback on October 17, 2008 at 2:30 am
  13. 13 Weekly linkdump #147 - max - блог разработчиков Trackback on October 17, 2008 at 9:17 am
  14. 14 Brokekid.net » Blog Archive » Tips & Tricks for Google Apps Trackback on October 17, 2008 at 4:25 pm
  15. 15 Brokekid.net » Blog Archive » Monkey Vs Robot - My Weekly Idealist post Trackback on October 17, 2008 at 5:06 pm
  16. 16 Nodalities » Blog Archive » The Revolution Starts (near) Here Trackback on October 17, 2008 at 11:33 pm
  17. 17 Media Literacy: Making Sense Of New Technologies And Media by George Siemens - Oct 18 08 | Geekblog.biz Trackback on October 18, 2008 at 9:26 am
  18. 18 No coding required? « Red56 blog Trackback on October 18, 2008 at 10:30 am
  19. 19 triple|scape » Blog Archive » TWILD for October 18, 2008 Trackback on October 18, 2008 at 8:13 pm
  20. 20 Daily Del.icio.us Trackback on October 19, 2008 at 4:00 am
  21. 21 Zaubern mit Google Documents | der leumund. Trackback on October 20, 2008 at 11:55 am
  22. 22 Data Scraping Wikipedia with Google Spreadsheets « OUseful.Info, the blog… | clhmedia.com Trackback on October 20, 2008 at 11:58 am
  23. 23 Manipolare i dati sul Web at Anto’stuff Trackback on October 22, 2008 at 9:01 am
  24. 24 Referrer Traffic from Amazon - WTF?! « OUseful.Info, the blog… Trackback on October 22, 2008 at 10:02 am
  25. 25 :: Tom Smith’s: theOTHERblog Trackback on October 24, 2008 at 11:02 am
  26. 26 Terror in the Heartland » ma.gnolia linkdump Trackback on October 26, 2008 at 7:08 am
  27. 27 ollyjackson.co.uk - fine quality hypertexts since 1998 Trackback on October 26, 2008 at 3:54 pm
  28. 28 ollyjackson.co.uk - fine quality hypertexts since 1998 Trackback on October 26, 2008 at 3:55 pm
  29. 29 Data Scraping: Using Google Docs to grab table data « Everyday Scripting Trackback on October 28, 2008 at 8:06 am
  30. 30 » Blog Archive » links for 2008-10-23 Trackback on October 29, 2008 at 11:26 pm
  31. 31 links for 2008-10-30 - schorleblog Trackback on October 30, 2008 at 7:40 pm
  32. 32 rascunho » Blog Archive » links for 2008-10-30 Trackback on October 30, 2008 at 11:22 pm
  33. 33 Data Scraping Wikipedia « LocalLab : Foire aux Infos Trackback on November 2, 2008 at 1:27 pm
  34. 34 Scraping avec Google SpreadSheets | taggle.org Trackback on November 5, 2008 at 9:41 am
  35. 35 October 2008’s most-clicked posts to the link list: Data scraping, design experts, copy editors, killing the article, local. - Joe Think Trackback on November 9, 2008 at 10:51 pm
  36. 36 Data Sets - A huge list | Austin Rude Trackback on November 12, 2008 at 9:22 am
  37. 37 Google MyMaps Now With RSS (= Easy Geoblogging) « OUseful.Info, the blog… Trackback on November 17, 2008 at 11:04 am
  38. 38 Is Scraping Data Wrong? « Alexander The Great Trackback on November 18, 2008 at 5:24 am
  39. 39 Approxi-mapping Mash-ups, with a Google MyMaps Tidy Up to Follow « OUseful.Info, the blog… Trackback on November 21, 2008 at 9:24 am
  40. 40 Abject Learning » Blog Archive » Is there a remedial data literacy class I should be signing up for? Trackback on November 24, 2008 at 7:45 pm
  41. 41 My CETIS 2008 Presentations « OUseful.Info, the blog… Trackback on November 27, 2008 at 12:48 am
  42. 42 Discover “What a Group Are Saying” with Yahoo Pipes and GoogleDocs :: Tom Smith’s: theOTHERblog Trackback on January 6, 2009 at 1:10 pm
  43. 43 Data Scraping at Research Pipeline Blog Trackback on January 18, 2009 at 5:54 am
  44. 44 Wikipedia-Tabellen in Google Spreadsheet importieren und in Google-maps darstellen - “, jede x-beliebige Tabelle, =importhtmlQuelladresse;table;n , , , Hinweis die Formel sollte eingetippt werden - wobei Sie die Wikipedia-Adresse einfügen können Wenn Trackback on January 31, 2009 at 2:59 pm
  45. 45   Cool google and wikipedia mashups by andydickinson.net Trackback on February 17, 2009 at 11:03 am
  46. 46 Mastering RSS: Control your Inputs, and Improve your Outputs « Rick Martin Trackback on February 23, 2009 at 9:09 am
  47. 47 News update « Daily Link Trackback on April 29, 2009 at 6:10 am
  48. 48 Mashing Wikipedia Data with Google Spreadsheets | Workplace Learning Today Trackback on May 4, 2009 at 1:00 pm
  49. 49 Horse Racing Data - Apuestas Forobet: Foro de apuestas deportivas Trackback on May 12, 2009 at 9:03 am
  50. 50 Tagz | "Data Scraping Wikipedia with Google Spreadsheets « OUseful.Info, the blog…" | Comments Trackback on May 16, 2009 at 5:10 pm
  51. 51 10 great Google hacks | www.amusingourselves.com Trackback on May 19, 2009 at 2:50 pm
  52. 52 Ten top Google hacks «10brilliant Trackback on May 19, 2009 at 8:38 pm
  53. 53 Initial Thoughts on “Mashup Patterns” « OUseful.Info, the blog… Trackback on June 11, 2009 at 11:38 am
  54. 54 Scraping, scripting, hacking « electronic museum Trackback on July 7, 2009 at 11:54 am
  55. 55 Online utilities for the non-hacker « just a dumping ground for now Trackback on July 28, 2009 at 3:22 pm
  56. 56 Blind Ape Seo » How to scrape without programming Trackback on July 29, 2009 at 2:13 pm

Leave a Reply




TweetMeme Chicklet

Custom Search Engines

How Do I? Instructional Video Metasearch Engine
OUseful web properties search

OUseful feedthru bookmarks...

Pages

 

October 2008
M T W T F S S
« Sep   Nov »
 12345
6789101112
13141516171819
20212223242526
2728293031