In all of the Gemini demos seen so far, a big thing is made of the fact that it can be used to integrate data from outside sources (eg some stuff you found on the web) with internal data (eg data from your data warehouse). This is all very well, but it assumes you have some way of actually capturing the data that’s out there on the web in a usable form and automating the capture of new data from these sources on a regular basis.
For example, if you’ve found a table containing some data you want on a website, what do you need to do to actually use it? You’d need to copy the table and then paste it into Excel perhaps; you’d then need to do some reformatting and editing, copy it again and then paste it to wherever you need it such as Gemini. All very manual; in fact, a right pain and not something you’d want to do on a regular basis. There might be an RSS or an Atom feed, and you might be able to use a tool like Yahoo Pipes (see also Jamie Thomson’s recent post), but there isn’t always one and even when there is it might not contain all the data you need or be in the right format.
Last week I had a call with Kapow Technologies, and they’ve got a really cool tool that addresses this problem:
It’s something like a cross between a screenscraper and an ETL tool, and it allows you to harvest data from web pages, do some transformation and cleansing, and then move it somewhere else like a database table (where more traditional ETL tools can take over) or another application. There’s a bit of background on this type of technology here:
I got hold of an eval version and had a play with it, and while I’m hardly the right person to give an expert review it does seem very powerful. It’s certainly easy to use: I had a simple scenario working within twenty minutes, and I think anyone with some SSIS experience will find it fairly straightforward. First of all you have to declare a set of objects (which can then be mapped onto relational tables) that will hold the data you want to collect; you then create a robot that will harvest data from web pages and load it into the objects; finally you can run the robot either from the command line or on a schedule. More information can be found in the help:
In my example, I took a web page from the BBC website that shows the weather where I live:
I wanted to harvest the time, the basic outlook and the temperature from the first box in the “first 24 hours” section:
You can see that I’ve selected the tag containing the time in the central browser window, and that it will be mapped to the attribute weather.forecastTime.
I think the thing that I find really exciting about this tool is that, when you think about it, there is a whole load of useful data out there on the web that would be great to use as part of your BI if you can extract it quickly and easily – and indeed legally, because I suspect there might be some legal objections to doing this on a large scale. One of the examples Kapow gave me was of a customer that harvested comments on their products from places like Amazon, then fed that data through a text mining application, to monitor public opinion on their products (today’s Dilbert shows how this information could be used!); I would have thought that this is something a lot of companies would like to do. The weather forecasts I was looking at could also be useful for predicting retail sales in the short term; if you’re an online retailer you’d probably want to compare your prices for certain products with those of your competitors. I’m sure we’ll be seeing much more use of web data in BI, captured using tools like this, in the future…