Archive for the ‘ETL’ Category
You may have already seen that the labs release of Data Explorer is now publicly available; there’s a whole load of really useful resources available on the learning page too if you’re interested in finding out more about it. I’ve been very lucky to have had early access to Data Explorer, and to test it out I put together a simple demo using the cloud service that shows off a typical use-case.
The first thing I did was to use Google Docs (just to have a cross-platform demo, not because I love Google in any way, honest…) to create a simple questionnaire using Google Forms. Before you read any further, please go and fill out the questionnaire I created here:
Don’t worry, there’s only three questions and it’s all non-personal data! For those of you reading offline, here’s a screenshot:
Now when you create a questionnaire like this in Google Forms, the responses get put inside a Google Docs spreadsheet. Here’s the link to the spreadsheet behind my questionnaire:
The good thing about Google Docs (unlike, ahem, the Excel Web App) is that it has an API. The contents of this sheet could easily be exported to a number of formats including csv, which means I could get the data into PowerPivot very easily. But there was a problem: the last question is multiple choice, and for the results of that question I got a comma-delimited list of values in a single cell in the spreadsheet (see the above screenshot) – which was not going to be very useful for analysis purposes. What I really wanted was all this data split out into separate columns, one column for each version and containing a boolean value to show if that version has been checked, so if I was going to analyse my responses by version I clearly needed to do some ETL work. I could do this with a calculated column inside PowerPivot of course, but the problem with this is that every time someone wanted to work with this data in a new PowerPivot model they’d have to repeat all this work, which is a pain, and clearly some users wouldn’t have the DAX skills to do this. The best thing to do would be to perform the ETL somewhere up in the cloud so everyone could benefit from it…
Enter Data Explorer. I created a simple mashup with the following steps:
- Imported the data from the Google spreadsheet as a csv file
- Cast that data as a table
- Split the Timestamp column into two separate Date and Time columns
- Added new columns to the table for each version of SSAS that contained the value True if that version had been checked in a particular response, False if not
Apart from the usual struggles that go with learning a new language, it was pretty straightforward and I was impressed with how easy it was to use. Here’s an example of an expression that adds a new column to show whether the respondent checked the “OLAP Services” box in the final question:
= Table.AddColumn(#"Rename Date Time", "Used OLAP Services", each if Text.Contains([#"What versions of Analysis Services have you used?"],"OLAP Services") then "True" else "False")
Finally, I published the output of the mashup publicly. This page contains all of the links to download the live data in various different formats:
If you filled in the questionnaire you should be able to find your responses in there because it’s a live feed.
And you can of course import the data into PowerPivot now very easily, for example by using the OData feed from Data Explorer. First, start Excel, go into the PowerPivot window and click on the “From Data Feeds” button:
Then, in the wizard, enter the URL of the OData feed:
And you should then have no problems importing the data:
…and then analysing the responses. You will want to create a simple measure with a definition something like this to do so to count the number of responses:
I’m looking forward to seeing the data come flooding in!
This approach could easily be applied to scenarios such as analysing feedback from user group meetings or events, and what with the number of online data sources out there there must be hundreds of other potential applications as well. And given that anyone can now publish and sell data on the Windows Azure Marketplace there must be ways of making money from this too…
I’ve just seen there’s a new release on the SQL Azure Labs site – Codename “Social Analytics” (that’s a bad codename by the way – did MS finally run out of place names in Washington state?). Full details are here:
It’s a service that allows you to collect social web data and then either use it for business analysis or automate responses; it has an OData endpoint so the obvious tool for doing the analysis is PowerPivot, and I guess it could also be a data source for Data Explorer when that comes along. At the moment it’s very limited in that you can only analyse the results of one of two feeds that aggregate data about either Bill Gates or Windows 8, which is a shame, but it should still be fun to play with.
One of the most interesting things I saw last week at the PASS Summit was Data Explorer, the cloud-based data transformation and publishing tool that was demoed in the keynote on day 1. While it was roundly dismissed as ‘yet more Excel’ by the disgruntled DBA faction I thought it showed some potential (you can see a walkthrough of what was shown here) – even if the fact that it was a SQL Azure Labs project suggested it was not destined to be a real product.
Today, however, I came across this post on Tim Mallalieu’s blog with a 10 minute video demo of an Excel addin version of Data Explorer, made earlier this year. Tim notes in his blog that:
We still have both the client and the cloud service but we only showed the cloud bits at PASS last week.
I would urge you to go and watch the video, because what’s shown is a very substantial, capable tool: an Excel addin for doing self-service ETL. Tellingly the name of the tool in the demo is “PowerImport” – and although Tim suggests in his blog that “some names and concepts have evolved quite a bit since March”, the choice of name speaks volumes. It looks like this could be to SSIS what PowerPivot is to SSAS, and a big selling point for Microsoft’s self-service BI story if it does get released.
I’ve just seen the news on the Windows HPC blog that Dryad and DryadLINQ are now in beta:
If you’re wondering what Dryad is, Daniel Moth has blogged a great collection of introductory links; and Jamie and I have been following Dryad for some time too. Although it seems like its integration with SSIS has fallen by the wayside its relevance for very large-scale ETL problems remains: it will allow you to crunch terabytes or petabytes of data in a highly parallel way. And given that you can now join Windows Azure nodes to your HPC cluster, it sounds like it’s a solution that can scale out via the cloud, making it even more exciting. I hope the people working on Microsoft’s regular and cloud BI solutions are talking to the HPC/Dryad teams.
After the distraction of the last few days, here’s a quick post about what I saw on day three of the PASS Summit last week. In fact it wasn’t quite as exciting as the previous two days but I did find out a bit more about two products I was interested in: Impact Analysis & Lineage, and Data Quality Services.
In fact I didn’t find out much about Impact Analysis & Lineage – it didn’t get much exposure at all for some reason; that’s a shame because I think it’s potentially very useful service. It allows you to see what impact any changes you make in one place in your BI solution have elsewhere: for example, if you rename a column in a dimension table it could break SSIS packages, SSAS dimensions and SSRS reports, but today it’s pretty difficult to know what the impact of any changes will be. Impact Analysis & Lineage, as far as I gathered, is a service that crawls all the files associated with each part of your BI solution and builds up a list of dependencies between them to help solve this problem. I suspect it wasn’t demoed because it’s not finished yet.
We saw a lot more about Data Quality Services. DQS is a substantial new product that allows end users and BI developers to create rules about what data is valid in a given scenario, and then apply these rules to perform automated data cleansing; it’s not, apparently, a rebadged version of the Zoomix product that MS bought a while ago, it’s a lot more ambitious than that. Example scenarios include cleaning addresses by comparing them to a master address list, possibly sourced from the Azure Datamarket; using regular expressions to ensure that valid urls and stock ticker symbols were stored in a table containing information about companies; and using fuzzy matching on names and addresses to find groups of customers who live together in the same household. Although some people I talked to were a bit put off by the bug-ridden demo, I was quite impressed by what I saw – a lot of thought seemed to have gone into it and the UI looks good. I think there’ll be an SSIS component that will allow you to apply your rules within a data flow too, but that wasn’t shown.