Archive for the ‘Cloud’ Category
Importing UK Weather Data from Azure Marketplace into PowerPivot
I don’t always agree with everything Rob Collie says, much as I respect him, but his recent post on the Windows Azure Marketplace (part of which used to be known as the Azure Datamarket) had me nodding my head. The WAM has been around for a while now and up until recently I didn’t find anything much there that I could use in my day job; I had the distinct feeling it was going to be yet another Microsoft white elephant. The appearance of the DateStream date dimension table (see here for more details) was for me a turning point, and a month ago I saw something really interesting: detailed weather data for the UK from the Met Office (the UK’s national weather service) is now available there too. OK, it’s not going to be very useful for anyone outside the UK, but the UK is my home market and for some of my customers the ability to do things like use weather forecasts to predict footfall in shops will be very useful. It’s exactly the kind of data that analysts want to find in a data market, and if the WAM guys can add other equally useful data sets they should soon reach the point where WAM is a regular destination for all PowerPivot users.
Importing this weather data into PowerPivot isn’t completely straightforward though – the data itself is quite complex. The Datamarket guys are working on some documentation for it but in the meantime I thought I’d blog about my experiences; I need to thank Max Uritsky and Ziv Kaspersky for helping me out on this.
The first step in the process of importing this data is to go to the Azure Marketplace and construct a query to get the slice of data that you want – this is a big dataset and you won’t want to get all of it. Once you’ve signed in, go to https://datamarket.azure.com/dataset/0f2cba12-e5cf-4c6d-83c9-83114d44387a, subscribe to the dataset and then click on the “Explore this Dataset” link:
This takes you to the query builder page, where you get to explore the data in the different tables that make up this dataset:
You choose the table you want to explore in the confusingly-named ‘Query’ dropdown box on the right-hand side of the screen. The available tables are:
- ThreeHourlyForecast, a fact tables containing three hourly weather forecasts
- ThreeHourlyForecastArchive, a fact table containing aggregated, averaged values for the various forecasts for a given date and time
- SignificantWeather, a dimension table containing the different types of weather that can be forecast
- Visibility, a dimension table containing the different levels of visibility
- DailyForecast, a fact table containing daily weather forecasts
- Observations, a fact table containing observed weather
- Site, a dimension table containing all the UK’s weather stations
As far as I can tell, this data is more or less the same as what’s available through the Met Office’s own DataPoint service, and the documentation for this is here: http://www.metoffice.gov.uk/public/ddc/datasets-documentation.html
Once you’ve selected a table you can construct a filter by entering values in the Optional Parameters boxes below the query dropdown. These changes are then reflected in the URL shown at the top of the screen:
This URL represents an OData query. One thing I didn’t notice initially is that the query that is generated here includes a top 100 filter in it which you’ll need to remove (by deleting &$top=100 from the end of the URL) if you want to do anything useful with the data; you might also want to build a more complex query than is supported by the query builder, and you can learn how to do this by reading this article.
In my case I decided to look at the full three hourly forecast data. As I said, this is a big dataset – initially I thought I’d download the whole thing, but 18 million rows and several hours later I cancelled the import into PowerPivot. Instead I opted to look at data for the whole of the UK for just one forecast made on one day, which worked out at a more manageable 250000 rows. What’s not clear from any of the current documentation is what all of the columns in the three hourly forecast fact table represent:
- Date is the date the forecast is issued
- StartTime is the time the forecast is issued and is either 0, 6, 12 or 18, representing midnight, 06:00, 12:00 and 18:00 – new forecasts are issued every 6 hours
- PredictionTime is the time that an incremental update to a forecast is issued; these incremental updates appear every hour. PredictionTime is an actual time value going from 0 to 23 representing the hour the incremental update was issued.
- TimeStep is an offset in hours from the StartTime, and represents the time that the forecast is predicting the weather for. It ranges in value from 0 to 120, going up in 3s (so the values go 0, 3, 6, 9… 120), meaning we have weather predictions for 5 days into the future for each forecast.
Therefore, for any given row in the ThreeHourlyForecast table, if the Date is April 10th 2012, StartTime is 6, PredictionTime is 8 and TimeStep is 9, then this is data from a forecast that was issued on April 10th 2012 at 8am (the second incremental update to the 6am forecast) and this row contains the prediction for the weather for the time StartTime+TimeStep = 6 + 9 = 15:00 on April 10th 2012.
Here’s the OData url I used to grab data for the three hourly forecast issued on April 10th at midnight (StartTime=0 and PredictionTime=0) for all weather stations and all time steps:
To use this URL in PowerPivot, you need to create a new PowerPivot workbook, open the PowerPivot window and then click the From Azure DataMarket button:
Then enter your query URL and Account Key (which you can find on the Query Explorer page by clicking on the Show link, as indicated in the screenshot above):
Having imported this data I also imported the whole of Site (renamed here to Weather Stations) and SignificantWeather tables to give the following PowerPivot model:
Here are the joins I used:
I also created a few calculated columns, including one called ActualPredictionForStart which added the TimeStep to the Start Time and the Date to get the actual date and time that the prediction is for:
=[Date] + (([StartTime] + [TimeStep])/24)
With this all done, I was able to find out what the predicted weather for the current time and my home town was in this (as of the time of writing) two-day old forecast:
…and do all the usual PivotTable-y and chart-y things you can do with data once it’s in Excel:
Incidentally, the forecast is wrong – it’s not raining outside right now!
PivotTables and Excel charts are all very well, but there’s a better way of visualising this data when it’s in Excel – and in my next post I’ll show you how…
UPDATE: First of all, I owe an apology to the Met Office – as soon as I hit publish on this post it started pouring with rain, so they were right after all. Secondly, in a weird co-incidence, Rob Collie posted about using the other weather dataset in the DataMarket on his blog: http://www.powerpivotpro.com/2012/04/download-10000-days-of-free-weather-data-for-almost-any-location-worldwide/
Bringing It All Together In The Cloud and Excel
A few things got released in the past week or so that I would normally have devoted a short blog post to; now I’ve finally made it onto Twitter I tend to just tweet about them instead, but I still think a blogging is the best way for me to get my thoughts together about what they actually mean. Let’s consider this ‘stuff’ in isolation first:
- http://www.excelmashup.com/ went live. It’s a JavaScript API for the Excel Web App which is useful, but as Jamie Thomson notes here it’s not the API he and I have been asking for for a long time, alas.
- The first CTP of Hadoop on Azure went live. Denny Lee’s blog post is probably the best place to start to learn about it; the video in the post is a very useful quick overview of what it does too. I noticed that it supports importing data from the Windows Azure Marketplace (what used to be called the Azure Datamarket).
- The Azure Marketplace also got a speed upgrade, as Boyan Penev notes here. This is important because whenever I’d tried to use it in the past its appalling performance had stopped me doing anything interesting with it at all. This, plus the fact that you can now publish your own data there, turns what was something of a white elephant into what could be an integral part of the Microsoft cloud BI platform.
- Version 2 of Google BigQuery went live, although it’s still in beta.
- The CTP of Data Explorer got released, which of course I’ve blogged about already here and which Jamie blogged about here and here.
- Microsoft announced Office 365 compliance with leading EU and US standards for data protection and security, which means less of those legal worries about whether you’re allowed to put all that interesting data you want to use for BI into the cloud.
From this it’s clear that Microsoft’s cloud BI platform is beginning to take shape, as are competing cloud BI platforms (if we assume that Google actually has a cloud BI strategy, and I think it has), and I think it’s fair to say Microsoft is well placed. There’s also yet more evidence, as if it was not blindingly obvious already, that Excel is at the heart of Microsoft’s BI strategy. Look at how data from Hadoop on Azure can be imported directly into Excel, and how this is flagged up as an important selling point, in the video on Denny’s blog post. However I think Microsoft needs to make even more of an effort to make everything Excel-like: now that it’s building a new BI platform from scratch it has a very rare opportunity to do this and to produce a truly coherent set of tools rather than the traditional grab-bag of technologies that make up the current Microsoft BI stack and which the cloud platform could also end up as too. Actually I’d like to go further and say that rather than have a bunch of separate cloud BI tools MS should make everything BI a feature of Excel, whether it be Excel on the desktop or Excel in the cloud. This might seem a controversial thing to say, but if MS is committed to true mass-market, self-service BI then Excel has to be the platform and MS needs to base its cloud BI strategy on it 100%.
Here are a couple of the things I’d like to see happen to achieve this:
- Data Explorer’s expression language needs to be made consistent with Excel syntax, in the way that DAX is consistent with Excel syntax. While I like what Data Explorer can do at the moment I hate having to learn a new syntax and a new set of functions for each tool I use (think SSRS expressions, SSIS expressions, MDX and so on) and it makes me much less productive when I do. I want to use the same syntax to write formulas in my spreadsheet, calculations in DAX and calculations/transformations in Data Explorer too – after all, we want to do a lot of the same things in ETL and reporting like lookups, aggregation and pivoting.
- Hadoop on Azure is an important tool to have, not least because Hadoop is already so widely used in the wider world so it adds some credibility to the MS BI stack, but I’d like to be able to control massively parallel calculations from inside Excel and not just consume the results of them there. I’m thinking something like DataScope or the way you can scale out Excel calculations on Windows HPC, maybe driven though a PowerPivot-like interface with calculations expressed in DAX, or Data Explorer (which links back to my last point, because if Data Explorer expressions were DAX it needn’t be an either/or choice). It could of course still be Hadoop at the back-end with a layer on top to make it consistent with Excel.
- We desperately need a proper API for the Excel Web App (yes, here we go again…). An OData API for importing and exporting data from spreadsheets on the web is only the first step; in the long term I’d want the Excel Web App to have capabilities like Project Dirigible, so that Excel on the desktop could become a kind of client tool for a massively scalable Excel Server in the cloud (note that I didn’t say Excel Services, which is more like Excel-on-the-server than a real Excel Server in my opinion). I’d want to be able to store data in the cloud and automatically synchronise it with Excel on multiple desktops or elsewhere in the cloud. I’d also want to create Excel spreadsheets in the cloud that acted like functions (similar to Project Dirigible), where these functions could be called from other Excel spreadsheets which again could be on the desktop or be in the cloud.
- We need Analysis Services in the cloud. We’ve known it’s been coming for a long time, but not what form it will take. Again, I’d like to see tight integration with Excel similar to how PowerPivot works on the client or tighter, where Analysis Services would not be a separate service but just be the functionality within Excel for manipulating tables of data for reporting purposes. I’d want Data Explorer to be able to create and load these tables.
- We also need the full BI functionality of desktop Excel – pivot tables and cube formulas – in the Excel Web App. I would guess this is in the pipeline anyway since it’s a pretty obvious requirement if Microsoft’s cloud BI strategy is going to work. I also don’t see the point of Power View being a separate app – I’d like to see it become a data visualisation feature of Excel.
- Finally, I’d like to see some way of tying all the data held in this gigantic data store together. You’d want to be able to search it, certainly, but also understand where it’s come from and what it actually represents. It sounds like maybe this is what Project Barcelona is aiming for…
Coordinating the work of multiple teams at Microsoft in the way this would demand is a particularly thankless task, I know. But maybe, just maybe, someone senior (maybe a Technical Fellow like Amir…?) could pull off something this ambitious?
Microsoft Codename “Social Analytics”
I’ve just seen there’s a new release on the SQL Azure Labs site – Codename “Social Analytics” (that’s a bad codename by the way – did MS finally run out of place names in Washington state?). Full details are here:
http://www.microsoft.com/en-us/sqlazurelabs/labs/socialanalytics.aspx
It’s a service that allows you to collect social web data and then either use it for business analysis or automate responses; it has an OData endpoint so the obvious tool for doing the analysis is PowerPivot, and I guess it could also be a data source for Data Explorer when that comes along. At the moment it’s very limited in that you can only analyse the results of one of two feeds that aggregate data about either Bill Gates or Windows 8, which is a shame, but it should still be fun to play with.
Excel DataScope
Jamie Thomson just tipped me off about something new and very interesting – one week after I had a moan about Microsoft doing nothing about Excel and the cloud, here comes Excel DataScope:
http://research.microsoft.com/en-us/projects/azure/datascope.aspx
Here’s the blurb from the site:
From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.
There are yet more tantalising details in the video and the two pdfs here:
http://research.microsoft.com/apps/video/?id=149888
http://research.microsoft.com/en-us/projects/azure/exceldatascope_11_poster1.pdf
http://research.microsoft.com/en-us/projects/azure/exceldatascope_11_poster2.pdf
I’m currently trying to find out more about all this, but there’s clearly a ton of cool stuff here:
- You can use the Live Labs Pivot control for visualisation.
- It does data mining in the cloud. Is this the successor to the old data mining addin? The functionality is clearly the same.
- There’s a section on Map Reduce running on Windows Azure on one of the posters. Is this using Dryad?
Is this a first glimpse of a new cloud-based BI platform from Microsoft? Will SSAS in the cloud form part of it? Before we all get too excited (or at least I get too excited) it’s worth noting that this is coming from the eXtreme Computing Group and not the SQL Server team, it’s clearly aimed at scientific rather than business users, and is described as “an ongoing research and development project”, ie it is not a commercial product. The potential is obvious though, and I hope it becomes something significant.
The problem of power users, DAX and difficult calculations
Ever since PowerPivot got released, one of the questions I’ve heard debated over and over is whether it’s easy for non-IT users to learn and use DAX or not. The stock answer from Microsoft, and I agree with them here, is that anyone with basic Excel knowledge can do simple calculations in DAX, but the more complex calculations (for example, those which need to use the CALCULATE() function) are probably only ever going to be written and understood by BI professionals. Even then there’s plenty of evidence that even BI and Excel pros struggle to get to drips with DAX, as blog posts like this and this suggest. This is certainly good news for me professionally – my feeling is that for more complex calculations, DAX is just as difficult as MDX although conceptually very different, and I’ve made a good living out of MDX over the years – but on the other hand I can’t help but feel this represents a capitulation on the part of Microsoft. After all, isn’t the target PowerPivot user someone like an accountant, actuary or statistician, someone who is mathematically literate and capable of creating amazing complicated spreadsheets? How can self-service BI be truly self-service if certain calculations are still too difficult for anyone but IT professionals?
The problem isn’t the design of DAX as such, I think, but the fact that DAX exists as a language. I’m sure most BI professionals have seen worked examples of calculations created in Excel that we have been expected to translate into SQL/MDX/DAX/whatever tool we’re using. It’s the act of translating the calculation into an expression that’s the difficult thing: end-users understand how the calculations work, but they can’t speak the languages that BI tools use whereas IT people can. So why can’t our BI tools allow users to express calculations in a way that users can understand – as a series of spreadsheet formulas?
One example of a tool that tries to do this already is the Intelligencia OLAP control. However, what prompted these thoughts was Project Dirigible, a new web-based spreadsheet from the people who produce Resolver One. The key feature that sets it apart from regular spreadsheets-in-a-browser is that it’s designed for scenarios where users want to scale out complex calculations over a large number of CPUs in the cloud. What makes this possible is the run_worksheet() function, documented here, that allows you to write a calculation that overwrites the values in cells on another sheet, recalculates that sheet, and then retrieves values from cells in that sheet – as this blog post explains, it allows you to use a spreadsheet as a function from within another spreadsheet. I think this is a deeply cool idea.
Here’s an extremely simple example of how it works. The following sheet:
http://www.projectdirigible.com/user/cwebb/sheet/1254/
Contains an example of a tax calculation, with three cells: the value before tax, the tax rate, and the value after tax.
This second sheet then contains a number of sales records and calls the first sheet to calculate the value for each of the sales after tax:
http://www.projectdirigible.com/user/cwebb/sheet/1264/
Here’s the function call I’ve used to calculate tax:
=run_worksheet("http://www.projectdirigible.com/user/cwebb/sheet/1254/", {(2,3) -> c3}).d3.value
Apart from the parallelism, there are a number of obviously good things that follow on from breaking calculations out into separate sheets like this: the ability to reuse this calculation across multiple spreadsheets; the ability to completely change how this calculation works at some later date so long as the parameters remain the same; and the fact that how it works is easily comprehensible to non-IT users because it’s expressed in a very visual way. I think this general approach could be applied to DAX: in DAX, most advanced calculations involve manipulating tables of data, filtering them and aggregating values in different columns. We have tables in Excel, these tables can be filtered, we can create calculated fields inside these tables and we can aggregate and perform other calculations on the data in these tables elsewhere in the sheet, so everything we can do in a DAX calculation can already be represented in the form of a spreadsheet. So if it was possible for the Vertipaq engine to understand calculations expressed in spreadsheet form instead of as DAX expressions then non-IT users would have far fewer limits on the kind of calculations they could write. Of course, this is a very big ‘if’ and there’s only a slim chance that something like this might happen – it would require a lot of development effort and probably that PowerPivot became a native feature of Excel.
Sadly, PowerPivot apart, I’m not sure I see much evidence of exciting new ideas in the world of Excel; indeed it seems like Excel has barely got to grips with the idea of the internet, let alone a truly cloud-based model. Why can’t all the interesting parallelisation stuff that’s possible with Windows HPC Server 2008 R2 be translated to the cloud and be made available to all Excel users? Why doesn’t the Excel Web App have any kind of API yet? I’m sure all the usual responses about lack of time and resources can be reused here, but I’d hate to see Excel become yet another IE6, neglected by MS while the competition innovates – and after all, isn’t Office one of Microsoft’s cash-cows, and so shouldn’t it be getting the kind of time and money lavished on stuff like Windows Phone and Bing…?
SQL Azure Reporting: Reporting Services in the Cloud
So SSRS in the cloud has just been announced! See this post on the SQL Azure team blog:
http://blogs.msdn.com/b/sqlazure/archive/2010/10/28/10082293.aspx
…and also this 20 minute video from PDC giving a lot more detail:
http://player.microsoftpdc.com/Session/5007e9c3-03cd-41b4-9e1c-4eb17cd60e37
Basically it’s the SSRS you know and love with only a few limitations: for example it only supports SQL Azure as a data source and there’s none of the developer extensibility options (like custom data extensions) available yet.
I can’t wait for SSAS in the cloud…
Hadoop on Azure?
Here’s something interesting I’ve just seen on James Dixon’s blog: apparently Microsoft is preparing to provide Hadoop on Windows Azure. Here’s the article James links to:
http://www.sdtimes.com/link/34319
I wonder if this is just a stopgap, in response to customer demand, as the article suggests it might be? How does this fit with the recently-announced Technical Computing Initiative? Was Project Dryad a dead end?
SQL Azure BI – now hiring
Hmmm, once again Microsoft’s job ads give away a few details on upcoming products. The SQL Azure team mentioned on their blog they were hiring, so I had a look what jobs were on offer. And lo-and-behold, there were a few jobs there for developers and testers for an Azure BI team. Here are two examples:
https://careers.microsoft.com/JobDetails.aspx?ss=&pg=0&so=&rw=1&jid=14197&jlang=EN
https://careers.microsoft.com/JobDetails.aspx?ss=&pg=0&so=&rw=2&jid=14206&jlang=EN
I quote from the first:
The SQL Azure team leads the expansion of the existing SQL Server business to the Cloud. As part of Microsoft’s comprehensive Azure Services offering we entered commercial availability earlier this year with a relational database service. We are extending the platform to bring Microsoft’s market-leading Business Intelligence services as part of a self-service BI solution.
and
Prior experience with SQL Server Reporting Services (SSRS) and SQL Server Analysis Services (SSAS) or other business intelligence technologies is desired
To be fair, we already knew that SSAS and SSRS in the cloud were coming at some point – it was on several publicly-available slide decks. Also, with PowerPivot being flavour-of-the-month and the cloud being the cloud, it’s hardly surprising that it’s a self-service BI solution. I dare say Dallas and OData fit in there somewhere too (incidentally, while we’re on the subject of OData, check out the Sesame Data Browser). The fact that all these ads have gone up recently and some of them talk about the ‘new’ BI team suggests development hasn’t started yet though, so I guess we won’t be seeing PowerPivot-in-the-cloud for another year or two; perhaps in time for Denali?
Tableau Public
In a clever marketing move, Tableau have just released a new free version of their tool called Tableau Public. I’ve been a fan of Tableau since I saw it a while back (was it really 2005?) but never seen it actually used at one of my customers, I suppose because of its hefty price. Anyway, what Tableau have done is created a basic version which can only connect to Access, Excel and Text files and can only save to Tableau’s own web gallery; however, it’s then possible to embed the visualisations you create in your own website/blog etc for sharing with the world. The stated aim is for this site to be the YouTube of data visualisation; coming from Tableau this is a half-credible claim because they really do understand this area and have some great tools. Also, I’m sure the extra publicity it will generate for the company will do no harm for sales of the paid version either.
And I’m sure it will be great when they get over their teething difficulties, because at the time of writing I can’t actually save anything… maybe they underestimated the amount of interest this would generate?
OK, it’s working now. But of course Windows Live Spaces doesn’t allow me to embed a ‘viz’ in a blog post (grr); instead, here’s a simple example I just uploaded using stats on the last 30 days of traffic on the front page of this blog:
http://public.tableausoftware.com/views/Blogstats/Sheet1
Here’s what it looks like in the client:
So Tuesday is the best day for page loads, but Thursday is slightly better for unique visitors… and so on. But it’s a cool tool and definitely worth checking out.
Pinpoint and Dallas
Interesting news from PDC: Microsoft has announced two new services – Pinpoint and Dallas.
You can find Pinpoint here: http://pinpoint.microsoft.com
Here’s the blurb from the site:
Pinpoint is the fast, easy way for business customers to find experts, applications, and professional services to meet their specific business needs—and build on the software they already have.
At the same time, Pinpoint helps developers and technology service providers quickly and easily get software applications and professional services to market—and engage customers who need what they offer.
Pinpoint is the largest directory of qualified IT companies and their software solutions built on Microsoft technologies.
- More than 7,000 software application offerings.
- More than 30,000 Microsoft-technology experts.
- The largest, most diverse set of Microsoft business platform offerings in the industry in a central location.
- Direct links between applications and the services that support them.
Whether you’re searching for expert help or offering it, Pinpoint helps you easily find and engage the right people and technologies to get the job done.
Much, much more interesting from a BI point of view is Dallas, which is part of Pinpoint: http://pinpoint.microsoft.com/en-US/Dallas
It’s Microsoft’s marketplace for data, all built on Azure. Again from the blurb:
Microsoft Codename “Dallas” is Microsoft’s Information Services business, enabling developers and information workers to instantly find, purchase, and manage Web services and datasets to power the next set of killer applications on any platform.
The Register has the best write-up of what this is here: http://www.theregister.co.uk/2009/11/17/microsoft_dallas_data_service/
From that article:
Dave Campbell, a Microsoft technical fellow, demonstrated Dallas at PDC. He showed a list of data provides from the partners such as infoUSA, subscriptions, the ability to store structured and unstructured data, and to explore the data without needing to parse it, to preview the data in ATOM, invoke the data as a Rest service and analyze the data using PowerPivot in Microsoft’s Excel spreadsheet program.
Note my emphasis on the last sentence! Here at last is the ability to buy that third party data that’s been a part of every Powerpivot demo. I’ve worked with a lot of companies that sell data in my career, and this looks like it could be a very significant development for them. I’d even heard vague rumours that MS were interested in buying commercial data providers at one point, several years ago – if they were prepared to go this extreme then it would certainly go a long way to making this strategy a success.
Now just think how cool it would be if SSAS or PowerPivot could be hosted on the cloud, so all you needed was Excel to analyse this data. Maybe one day…
