Archive for the ‘Data Mining’ Category
For my last post from the PASS Summit, I thought I’d mention briefly some of the products that caught my eye as I wandered round the exhibition hall this afternoon:
- OData Connectors from RSSBus (http://www.rssbus.com/odata/), a series of web apps that expose OData feeds (which then of course can be consumed in PowerPivot and SSAS Tabular) from a variety of data sources including Quickbooks, Twitter and MS CRM. I’d seen the website a month or so ago, actually, but I found out today they are close to releasing OData connectors for Google, Google Docs, Facebook, Email and PowerShell as well, which open up some intriguing possibilities for PowerPivot analysis. I can imagine doing a really cool demo where I set up an email address, got the audience to email me, then hooked PowerPivot up to my inbox and analysed the emails as they came in!
- XLCubed (http://www.xlcubed.com/) – well, ok, they aren’t exactly new to me but it was good to have a chat with the guys on the stand. It’s worth pointing out they have a good mobile BI story for SSAS users.
- Kepion (http://www.kepion.com/) – I was quite impressed with the demos I saw of their products for building SSAS-based BI solutions, especially for (but not restricted to) financial planning; it looked pretty slick.
- Predixion (http://www.predixionsoftware.com/predixion/) – again, the company itself isn’t new to me but I got a demo of their new product, Predixion Enterprise Insight Developer Edition, which I’d been meaning to check out for a while. This is an immensely powerful free tool for doing data mining in Excel and it’s very closely integrated with PowerPivot too. Even if you don’t want to do complex stuff, it has some features that would be useful for regular PowerPivot users such as the ability to select a column in a PowerPivot table, analyse the data in it and then generate bandings which are then persisted in a new calculated column.
Jamie Thomson just tipped me off about something new and very interesting – one week after I had a moan about Microsoft doing nothing about Excel and the cloud, here comes Excel DataScope:
Here’s the blurb from the site:
From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.
There are yet more tantalising details in the video and the two pdfs here:
I’m currently trying to find out more about all this, but there’s clearly a ton of cool stuff here:
- You can use the Live Labs Pivot control for visualisation.
- It does data mining in the cloud. Is this the successor to the old data mining addin? The functionality is clearly the same.
- There’s a section on Map Reduce running on Windows Azure on one of the posters. Is this using Dryad?
Is this a first glimpse of a new cloud-based BI platform from Microsoft? Will SSAS in the cloud form part of it? Before we all get too excited (or at least I get too excited) it’s worth noting that this is coming from the eXtreme Computing Group and not the SQL Server team, it’s clearly aimed at scientific rather than business users, and is described as “an ongoing research and development project”, ie it is not a commercial product. The potential is obvious though, and I hope it becomes something significant.
The Microsoft Research website is always worth a look when you’re bored – there’s some interesting stuff going on there, and occasionally one of the projects there turns into a real product. I saw on Mary Jo Foley’s site a mention of something called CamGraph:
I suspect she’s got the wrong end of the stick when she says it’s a natural user interface project, but that’s beside the point. According to the website it is:
a distributed, scalable infrastructure that supports inference in large-scale probabilistic graphical models.
Basically a data mining app that can scale out across multiple servers; with a bit of imagination you could see how this could turn into data mining in the cloud, but we’re clearly a long way away from that happening.
As I’ve said before, I’m involved with the organisation of the SQLBits conferences here in the UK and at the moment the SQLBits committee is busy preparing for SQLBits 8 in April (make sure you come – it’s going to be great!). This eats up a lot of my spare time – spare time that I usually spend blogging – so I thought I’d kill two birds with one stone and blog about some of the BI-related stuff I’m doing for SQLBits (I’ve done this before but there’s plenty more mileage in this subject). It turns out a lot of the things SQLBits needs to do requires classic ‘self-service BI’: solve a business problem as best you can with whatever data and tools are to hand. It’s good to see things from the end user’s point of view for a change!
First of all, let’s take a look at scheduling: how can we make sure that we don’t run two sessions in the same time slot that are interesting to the same type of attendee? If attendees are put in a situation where they are forced to choose between two sessions they want to see they won’t be happy – we want to be able to create a schedule where there are as few difficult choices as possible. Unfortunately we don’t collect data about which sessions attendees actually go to, and even if we did it would be no use because of course by the time the session runs it’s too late to fix the agenda. However, well before the conference we allow people to vote for the ten sessions out of all those that have been submitted that they’d like to see (voting has just opened for SQLBits 8, incidentally), and we use this data to help us decide which ones make it onto the agenda; we can therefore use this data to help avoid overlaps.
This data can be visualised very effectively using NodeXL. To do this, I ran a SQL query on the SQLBits database that gave me every combination of two sessions that had been picked by the same user, so for example if a user had selected sessions A, B and C my query returned the pairs A-B, A-C and B-C. This gave me my list of edges for the graph and for the size of the edges I used the number of times the combination of sessions occurred, so I could see the most popular combinations. Unfortunately with 107 sessions on the list and thousands of edges, I got something that looked like one of my four-year-old daughter’s scribbles rather than a useful visualisation, so I decided to filter the data and look at one session at a time. Here’s what I got for my session ‘Implementing Common Business Calculations in DAX’:
Still not great, but at least with the thicker lines you can see where the strongest relationships are and when you select these relationships it highlights them and the nodes on either end, so you can read the names of the sessions. I then realised you could use the ‘dynamic filters’ functionality to filter out the weaker relationships, making it even easier to pick out the strongest ones:
So we can now see that the strongest relationships were with the sessions “You can create UK maps with SSRS 2008 R2” and “Data Mining with SQL Server 2008”. I’m still getting to grips with NodeXL which, I have to say, I like more and more and which deserves more visibility in the MS BI world.
Anyway, since this is a basket analysis problem I also thought of using the Data Mining Addin for Excel, but since I have Office 2010 64-bit I couldn’t. Luckily though the nice people at Predixion do have a version of their addin that works on 64-bit, and they gave me another eval license to use on my data. Getting useful results out of Predixion turned out to be ridiculously easy: I just copied the raw data into Excel, clicked the ‘Shopping Basket Analysis’ button on the ribbon and it spat out a pair of nicely-formatted reports. The first shows ‘Shopping Basket Recommendations’, ie if you select one session it recommends another one you might like:
And the second shows the most commonly-occurring ‘bundles’ of sessions that were picked together:
It almost feels too easy… but I think you can see that the results look correct and to be honest it’s much easier to do something useful with this than the NodeXL graph. When we close the voting for SQLBits 8 I’ll repeat the exercise and hand the results over to Allan, who’s in charge of speakers, and he’ll be able to use them to put together our agenda for Saturday April 9th.
As you’re probably aware, in my spare time I’m one of the people who help organise the SQLBits SQL Server conference in the UK. For the last few conferences I’ve worked on marketing, which basically means writing all those SQLBits emails that get sent out every week and thinking of ways to get ever more people to come along to the conference. Given that we handle all our registrations via RegOnline and can download all of the data we capture during the registration process I thought I ought to practice what I preach and do something BI-ish with it.
The first thing I did, of course, was to load all the data into PowerPivot. That worked well but, let’s be honest, you’ve already read 500 blog posts about loading data into PowerPivot and don’t want to read another one. Then I remembered the emails I’d been getting from Predixion, the data mining startup founded by amongst others Jamie MacLennan (who used to be on the SSAS data mining dev team), about their beta and thought I’d see what I could do with it.
Predixion have got a number of videos on their site to help you get started:
… and Kaspar has already blogged about what he’s done with Predixion twice (here and here), as has Mark Tabladillo, so I won’t repeat any of what they say and just concentrate on my results; suffice to say that anyone that used the free SSAS data mining addin will feel very at home with using Predixion Insight – in many respects it’s a much improved version 2.0 of that tool.
The most useful results I got were using the ‘Detect Categories’ tool on a table containing all my registration data to identify groups of attendees with similar profiles. For obvious reasons I can’t show you the raw data or anything derived from it that might identify individual attendees, but here’s a screenshot of the Excel report that got generated for me:
This report also shows data on the first cluster or category that the tool identified, and which I named ‘Hardcore SQLBits Fans’. You can see that these people are extremely likely to have registered very early: we opened registrations at the end of July, the ‘RegDate’ column shows the date they registered (DaysToConference shows the number of days from RegDate to the conference date), and this group was highly likely to have registered within two weeks of registration opening. Similarly they are very likely to be attending the full conference and not just individual days, and as a result are spending a lot of money with us. We’ve always known there’s a core group of people who come to SQLBits regardless of where or when it is, and this is that group – they’re our biggest fans and therefore we’ve got a duty to keep them happy!
Here’s another group, equally important for us, which I named ‘European Big Spenders’:
As you can see, the people in this group also spend a lot of money with us (perhaps more than the hardcore fans); registered slightly later, in mid-August; and are coming outside the UK. We made a special effort to attract more attendees from Europe this time so this is all good to see, but interestingly this group is very likely not to be using a discount code to get money off the price of the paid parts of the conference. I spent hours setting up discount codes for various European user groups and it doesn’t look like this strategy has paid off; maybe the fact that these people are coming is totally unconnected to any of the efforts we made?
Last of all, let’s look at the ‘Delayed Friday’ group:
The people here are only coming for the Friday and not attending the training day; consequently they’re spending less money overall. They registered in mid-to-late August and the discount codes used (which I’ve had to black out, sorry) suggest they heard about us from user groups, emails that our sponsors sent out to their customers, or places like SQL Server Central. They’re not such rabid fans of SQLBits that they would come automatically, but they’ve been successfully marketed to, seen what we’ve got to offer and registered. These are the attendees we’ve had to work hard to attract, and so the ones we need to understand best to ensure they turn into the hardcore SQLBits fans of the future!
So, as you can see, with one click of a button Predixion Insight came up with loads of interesting, useful information for me – information I probably wouldn’t have found if I was browsing through the data myself using SSAS or PowerPivot. And this is only a taste of what it can do; I didn’t even try out much of the more advanced functionality. I can imagine Predixion’s cloud-based data mining offering is going to generate a lot of interest, and maybe after many years the time has come for data mining for the masses?
Here’s something interesting for anyone into cloud-based data mining: Google have announced a new prediction API. More details here:
Here’s the summary from the site:
The Prediction API enables access to Google’s machine learning algorithms to analyze your historic data and predict likely future outcomes. Upload your data to Google Storage for Developers, then use the Prediction API to make real-time decisions in your applications. The Prediction API implements supervised learning algorithms as a RESTful web service to let you leverage patterns in your data, providing more relevant information to your users. Run your predictions on Google’s infrastructure and scale effortlessly as your data grows in size and complexity.
Looks exciting! Now, why didn’t Microsoft do this? Clearly the ideas were there (remember the cloud version of the Excel DM addin?), but it seems like all the key members of the SQL data mining team had to leave MS to pursue their dreams: http://predixionsoftware.com/
It’s ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining – I’ve never done a serious project with it, unfortunately – and from what I can see the people out there working with Analysis Services data mining don’t tend to use it with cubes; it’s two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.
The data set I’m using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It’s interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:
I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:
The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here’s what I did to get that:
- I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:
1 Actual 2 Forecast
This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.
- I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
- Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable.
Now I’m certainly not a data mining expert and I didn’t spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above:
Obviously if you were doing this for real you’d want to spend a bit more time making sure your mining model was giving you good results.
- With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009:
One thing I found in order to get this to work nicely was that I had to change the ‘meaningful’ surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension – before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
- I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this:
Here’s what I did at each step:
- Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
- Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
- Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
- Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
And of course if I didn’t include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
Some thoughts on how you’d implement this in a real production system:
- You’re not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you’d probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn’t just create a new partition in your existing measure group to hold the forecast data, you’d need to create a new measure group. However with some simple MDX you’d be able to integrate the two sets of values and get a similar result to the one I’ve achieved here.
- You’d want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
- You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you’d need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
- As I said, you’d want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you’d load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there’s a 2008 version of the DM team’s book out soon too – I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.
So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I’d be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I’ll blog about it.
And wouldn’t it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?