Guardian Data Store – free data, and some ideas on how to play with it
I was reading the Guardian (a UK newspaper) online today and saw that they have just launched something called Open Platform, basically a set of tools that allow you to access and build applications on top of their data and content. The thing that really caught my eye was the Data Store, which makes available all of the numeric data they would usually publish in tables and graphs in the paper in Google Spreadsheet format. Being a data guy I find free, interesting data irresistible: I work with data all day long, and building systems to help other people analyse data is what I do for a living, but usually I’m not that interested in analysing the data I work with myself because it’s just a company’s sales figures or something equally dull. However give me information on the best-selling singles of 2008 or crime stats for example, I start thinking of the fun stuff I could do with it. If you saw Donald Farmer’s fascinating presentation at PASS 2008 where he used data mining to analyse the Titantic passenger list to see if he could work out the rules governing who survived and who didn’t, you’ll know what I mean.
Given that all the data’s in Google Spreadsheets anyway, the first thing I thought of doing was using Panorama’s free pivot table gadget to analyse the data OLAP-style (incidentally, if you saw it when it first came out and thought it was a bit slow, like I did, take another look – it’s got a lot better in the last few months). Using the data I mentioned above on best-selling singles, here’s what I did to get the gadget working:
- Opened the link to the spreadsheet: http://spreadsheets.google.com/pub?key=phNtm3LmDZEP4i_r7RdVhUg
- Followed the link at the very bottom of the page to edit the page.
- On the new window, clicked File/Create a Copy on the menu to open yet another window, this time with a version of the data that can be edited (the previous window contained only read-only data)
- Right-clicked on column J and selected Insert 1 Right, to create a new column on the right-hand side.
- Added a column header, typed Count in the header row, and then filled the entire column with the value 1 by typing 1 into the first row and then dragging it down. I needed this column to create a new measure for the pivot table.
- Edited the ‘Artist(s)’ column to be named ‘Artist’ because apparently Panorama doesn’t like brackets
- Selected the whole data set (the range I used was Sheet1!B2:K102) and then went to Insert/Gadget and chose Analytics for Google Spreadsheets. It took me a moment to work out I had to scroll to the top of the sheet to see the Panorama dialog that appeared.
- Clicked Apply and Close, waited a few seconds while the cube was built, ignored the tutorial that started, spent a few minutes learning how to use the tool the hard way having ignored the tutorial, and bingo! I had my pivot table open. Here’s a screenshot showing the count of singles broken down by gender and country of origin.
Of course, this isn’t the only way you can analyse data in Google spreadsheets. Sisense Prism, which I reviewed here a few months ago, has a free version which can connect to Google spreadsheets and work with limited amounts of data. I still have it installed on my laptop, so I had a go connecting – it was pretty easy so I won’t go through the steps, although I didn’t work out how to get it to recognise the column headers as column headers and that polluted the data a bit. Here’s a screenshot of a dashboard I put together very quickly:
Lastly, having mentioned Donald Farmer’s Titanic demo I thought it would be good to do some data mining. The easiest way for me was obviously to use the Microsoft Excel data mining addin: there are two flavours of this: the version (available here) that needs to be able to connect to an instance of Analysis Services, and the version that can connect to an instance of Analysis Services in the cloud (available here; Jamie MacLennan and Brent Ozar’s blog entries on this are worth reading, and there’s even a limited web-based interface for it too). Here’s what I did:
- Installed the data mining addin, obviously
- In the copy of the spreadsheet, I clicked File/Export/.xls to export to Excel, then clicked Open
- In Excel, selected the data and on the Home tab on the ribbon clicked the Format as a Table button
- The Table Tools tab having appeared on the ribbon automatically, I then pressed the Analyze Key Influencers button
- In the dialog that appeared, I chose Genre from the dropdown to try to work out which of the other columns influenced the genre of the music
- Clicked I Agree and Do Not Remind Me Again on the Connecting to the Internet dialog
- Added a report comparing Pop to Rock
Here’s what I got out:
From this we can see very clearly that if you’re from the UK or under 25 you’re much more likely to be producing Pop, Groups are more likely to produce Rock, and various other interesting facts.
So, lots of fun certainly (at least for a data geek like me), but everything I’ve shown here is intended as a serious business tool. It’s not hard to imagine that, in a few years time when more and more data is available online through spreadsheets or cloud-based databases, we’ll be doing exactly what I’ve demonstrated here with that boring business data you and I have to deal with in our day jobs.