Archive for the ‘MDX’ Category
Market basket analysis is a fairly well understood problem for SSAS – it’s best handled using many-to-many relationships, as shown in the excellent The Many-to-Many Revolution white paper. However the other week I was asked a question that was an interesting variation on this problem, and which shows up one of the limitations of the many-to-many approach.
Let me show you an example. The following PivotTable on the Adventure Works cube shows the number of distinct customers who bought products in the product category Bikes:
The Customer Count measure is a distinct count measure, the Country hierarchy from the Customer dimension is shown on rows, and the slicer here shows the Category hierarchy from the Product dimension. If you select Bikes and Clothing in the slicer, like so:
… the measure values are larger because what you’re seeing now is the number of distinct customers who bought products in either the category Bikes or the category Clothing.
But what if you want to see the number of customers who bought Bikes AND Clothing? Well, setting up a many-to-many relationship will give you this (and a lot more useful stuff too) but there are some limitations with this approach:
- Adding the m2m relationship and the extra measure groups it needs increases processing time
- You have to select your two product categories from two different (but identical) hierarchies
- You can only select two product categories – if you want to add a third to your AND filter you need to add a third m2m relationship, and so on. It’s not very flexible.
It is possible to get around these issues with a pure MDX approach, and apply an AND filter that works with however many product categories that the user selects.
The key to understanding how this works is to see how you would write an MDX calculated member that does an AND filter on two product categories:
Here I’ve used two nested NonEmpty() functions to filter the set of all customers, first to get the ones that bought Clothing, then to filter these customers again to get the ones that bought Bikes; there’s also an EXISTING there to get only the ones in the current country.
Once again, to add more product categories to the AND filter you need to add more nested NonEmpty() functions… which means you need a way to dynamically generate the code, which of course you can do using StrToSet(). Now normally I avoid using StrToSet() inside MDX calculations because it can cause serious performance problems but in this case it’s the best choice.
The following named set and calculated measure show how to solve the problem for Adventure Works:
The dynamic named set is there to help work out what is selected in the slicer in my Excel worksheet, as shown in Mosha’s old post here. The calculated measure is where all the interesting stuff happens: it uses the ability of the Generate() function to iterate over a set (in this case the dynamic named set), evaluate a string expression and concatenate these strings. The output of this is a series of nested NonEmpty()s, which then goes to StrToSet() to be evaluated, and then the contents of that set are counted.
This technique does not replace using a m2m relationship, because it will not allow you to show a query with product category on rows and columns and the number of customers who bought each combination shown. However, if all you want to do is show the number of customers who have bought one specific combination, this has a lot of advantages.
For all of your MDX fans out there, here’s another approach I came up with which doesn’t perform quite as well but is so much fun I had to share it:
I’m using Generate() here again, but this time I’m using it with inline named sets (which are even worse for performance than StrToSet) in a way similar to this classic post. The theory is the same though: it’s a way of calling NonEmpty() multiple times over a set with an arbitrary number of items in.
Here’s what the calculation does:
- Inside the first Intersect() I’m declaring an inline named set called MYCUSTOMERS, then doing the Intersect() with an empty set to return… and empty set. But I’ve managed to declare my inline named set, which is the point here.
- I’m then iterating over the same dynamic named set shown in my earlier example using Generate(), and:
- For all but the last product category in that set, I’m doing the NonEmpty() on the contents of MYCUSTOMERS on the current product category and then overwriting the contents of MYCUSTOMERS with the output. However, for these iterations of Generate() I’m returning an empty set.
- For the last product category in the set I’m actually returning the output of NonEmpty() over MYCUSTOMERS for the current product category. The reason I’m only returning something other than an empty set on the last iteration of Generate() is that Generate() returns the union of all the sets returned by each iteration, and that would give me the wrong results! I only want to output the set from the final iteration.
Recently I had to solve the same, rather complex, problem for three different customers within the space of two weeks, a variation of the ‘events-in-progress’ problem I’ve blogged about a lot in the past. It’s this one: how can you calculate the value of your overdue invoices on any given date? It’s important to stress that we are not talking about the invoices that are overdue today – we want to be able to pick any date in the past and find out the value of invoices that were overdue at that point in time.
Let’s imagine you have a fact table containing payments against invoices: each row has an invoice number, the date of the payment, the payment amount, and the date that the invoice is due. A positive value in the Amount column indicates that this is the opening amount of the invoice; negative values in the Amount column are payments against the invoice. Multiple payments can be made against an invoice before it is fully closed. Here’s some example data:
Looking at the rows highlighted for invoice 5, you can see in the upper box that there is a positive amount of £35 shown on January 1st 2014 – this is the date that the invoice was opened, and £35 is the full value of the invoice. The invoice is due on January 10th 2014. In the lower box you can see there were four separate payments of £5, £5, £10 and £15 before the invoice was fully paid off on January 12th 2014.
Now, if you were to do a total-to-date in MDX (similar to what I describe here) it would be pretty easy to calculate the outstanding amount on all invoices on any given date. However the requirement here is not only to do that, but to break the value down so that you can see what the total value for overdue invoices and non-overdue (which I’m going to call backlog in this post) invoices is on any given date. This is tricky to do because we have to take the Due Date of each invoice into account as well as the Payment Date. The desired output for invoice 5 is this:
As you can see in this PivotTable, for invoice 5 £25 was still outstanding on January 9th 2014, but this is shown as backlog because this is before the due date of January 10th. On January 11th, one day after the due date, the remaining outstanding amount of £15 is shown as overdue. Of course, we also want to be able to calculate the correct values for all invoices:
One way of solving this problem would be to calculate the overdue and backlog values for each date that each invoice is open in your ETL, and store these values in a snapshot fact table. This works, and will give you the best possible query performance, but it has two major drawbacks: first, it makes your ETL much slower, and second it makes your fact table much larger. This post describes how you can calculate the overdue and non-overdue values on any given date using many-to-many relationships instead, without having to blow out the fact table.
To go along with the fact table (called FactInvoice) shown above, I have a date dimension table called DimDate (I have deliberately reduced the number of rows here to the dates I have data for, for reasons that will become clear soon):
I have a dimension table containing all of the days that an invoice can be overdue or not overdue for (which has one less than double the number of rows as the date dimension table), plus a second column classifying each row as ‘Backlog’ or ‘Overdue’:
… and an invoice dimension table that just contains the distinct invoice numbers called DimInvoice.
Now, let me explain how to build the SSAS cube.
- Build dimensions from all of the tables shown above, so you have dimensions called Date, Overdue Days and Invoice.
- Build a cube with one measure group, based on FactInvoice, and create one measure with AggregateFunction Sum based on the Amount column in that table.
- Add the Invoice dimension to the cube with a regular relationship. Add the Date dimension to the cube twice, as Payment Date and Due Date, with regular relationships on the PaymentDate and DueDate columns. The Dimension Usage tab should look like this:
- Create the following view in SQL Server:
- Add this view to the DSV and create a new measure group from it; you’ll need to create a measure here, but it can just be a Count measure.
- Add the Date dimension to the cube once again, this time leaving the name as Date (you will now have three role-playing copies of the Date dimension in the cube).
- Set up regular relationships between the Date and Payment Date dimensions and the new measure group, then a many-to-many relationship between Date and the Invoice measure group. This means that when you query the cube by the Date dimension, the many-to-many relationship will mean you see the sum of all Amounts whose payment date is up to and including the date selected. The Dimension Usage tab will look like this:
CREATE VIEW [dbo].[FactDateToPaymentDate]
SELECT a.DateKey, b.DateKey AS PaymentDateKey
FROM dbo.DimDate AS a INNER JOIN
dbo.DimDate AS b ON a.DateKey >= b.DateKey
This view returns all of the combinations of a given date and all dates up to and including the date.
- Create another view in SQL Server with this definition:
- Add this view to the DSV and build another measure group from it, again with a single Count measure on it.
- Set up a regular relationship between this new measure group and the Due Date dimension.
- Add the Overdue Days dimension to the cube, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group.
- Add the Date dimension to the cube yet again, creating a new role-playing dimension called As Of Date, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group. The Dimension Usage tab will now look like this:
CREATE VIEW [dbo].[FactDueDateToAsOfDate]
SELECT a.DateKey AS DueDateKey, b.DateKey AS AsOfDateKey, DATEDIFF(dd, a.FullDate, b.FullDate) AS OverDueDays
FROM dbo.DimDate AS a CROSS JOIN
dbo.DimDate AS b
Yes, I am cross joining the DimDate table with itself and yes, this could return a lot of rows. However you should find that the view is very fast to execute.
The purpose of this measure group is this: if you select a date on the As Of Date dimension, you will be able to select ‘Overdue’ on the Overdue Days dimension and this will give you all of the dates on Due Date that were overdue on that date.
- You only want to have to select one date in your PivotTable, so create the following MDX calculated measures that take your selection on the Date dimension and applies it to the As Of Date dimension too:
CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount Hidden] AS
LINKMEMBER([Date].[Date].CURRENTMEMBER, [As Of Date].[Date])
CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount] AS
IIF(MEASURES.[Open Amount Hidden]=0, NULL, MEASURES.[Open Amount Hidden]);
- Hide all measures apart from the calculated measures you’ve just created
- Hide the Payment Date, Due Date and As Of Date dimensions
This is a very complex pattern, I know, and this is after I’ve simplified it a lot (if you need currency conversion as well then things get even worse) but I also know it’s extremely useful from a business point of view. Query performance is also reasonably good, at least in the places where I have implemented this.
You can download my sample SQL Server database and VS 2012 project here.
The Unorder() function is probably one of the least used functions in the whole of MDX. It exists only as a query performance hint and, since I had never up to now found a scenario where it did improve the performance of a calculation I had pretty much forgotten about it (as Books Online says, the optimisation it performs is applied automatically in many cases). However I was playing around with some calculations last week and found out that it does have its uses…
What does the Unorder() function do? It’s a function that takes a set and returns a set, and what it does is remove any implicit ordering from that set. By default all sets in MDX are ordered, but for some types of operation that ordering is unimportant and ignoring it can result in faster query performance.
Take, for example, the following query on the Adventure Works cube which shows the number of customers who have bought something up to the current date:
On my laptop it executes in 35 seconds on a cold cache. We can optimise the calculation here simply by wrapping the set of all members on the Customer level of the Customer hierarchy with the Unorder() function, so:
The following query now executes in 27 seconds on a cold cache:
As far as I can tell, Unorder() only makes a difference on calculations when used in combination with NonEmpty(), and when it is used over a large set (here the set of customers has around 18000 members). If you have calculations like this I would recommend testing to see if Unorder() makes a difference – if it does, please leave a comment and let me know what you find!
I had an interesting challenge from a customer yesterday – one of those problems that I’d known about for a long time but never got round to working out the solution for…
Consider the following PivotTable, based on a PowerPivot model using Adventure Works data, in Excel 2010:
It shows the top 10 products by the measure Sum of Sales. There are two slicers, and the top 10 shown in the PivotTable reflects the selections made in the slicers. All of this works fine. But what if you want to use Excel cube functions to do the same thing? You can write the MDX for the top 10 products quite easily and use it in the CubeSet() function in your worksheet, but how can you get your MDX set expression to respect the selection made in the slicers?
The solution to this problem is very similar to the trick I showed here – finding the selected items in a slicer is not easy! Here are the steps I followed to do it:
- Add the slicers for EnglishOccupation and CalendarYear to a new worksheet
- Go to Slicer Settings and uncheck the box for “Visually indicate items with no data”
- Add two new PivotTables to the worksheet. Connect one to the EnglishOccupation slicer and put EnglishOccupation on rows; connect the other to the CalendarYear slicer and put CalendarYear on rows.
- Use the OLAPPivotTableExtensions add-in (which you can download here) to add new MDX calculated measures to each PivotTable. For the EnglishOccupation PivotTable call the measure SelectedOccupations and use the following MDX:
This expression does the following: it uses the Axis() function to find the set of members selected on what Excel thinks of as the rows axis in the PivotTable (actually the MDX columns axis), then uses Except() to remove the All Member from the hierarchy (which Excel uses for the Grand Totals) and then uses SetToStr() to take that set and return the string representation of it. Do the same thing for the PivotTable showing CalendarYear too, calling the calculated measure SelectedYears; the MDX in this case is:
This is what the EnglishOccupation PivotTable should look like:
- Next, to make things easy, use Excel formulas to get the values from the top cell inside each PivotTable into cells elsewhere in the worksheet, and give these cells the names SelectedOccupations and SelectedYears.
- Then enter a CubeSet() function into a new cell using the following formula:
Sum(" & SelectedOccupations & " * " & SelectedYears & ",[Measures].[Sum of Sales])
"Top 10 Set")
What this does is use the TopCount() function to find the top 10 Products, and in the third parameter of this function which is the numeric expression to find the top 10 by, it crossjoins the two sets of selected occupations and selected years and then sums the output of the crossjoin by the measure [Sum of Sales].
- Last of all, build your report using the Excel cube functions as normal, using the CubeRankedMember() function to get each item from the top 10 set created in the previous step.
You can download my sample workbook here.
The bad news about this technique is that it doesn’t work in Excel 2013 and Power Pivot. It’s no longer possible to create MDX calculated measures on Power Pivot models in Excel 2013, alas. It will work if you’re using any version of Excel from 2007 on against Analysis Services and, as I show here, Excel 2010 and PowerPivot. If you are using Power Pivot and Excel 2013 it might be possible to create a DAX measure to do the same as the MDX I’ve used here (I’m wondering if the technique Jason describes here will work). It would certainly be possible to use CubeRankedMember() to find each item selected in the slicer, as Erik Svensen shows here, and then use Excel formulas to find the MDX unique name for each selected member and concatenate these unique names to create the set expression that my calculated measures return, but that’s a topic for another post. This really should be a lot easier than it is…
One subject I have blogged about many times here is how the use of certain MDX functions and features prevents the Formula Engine from caching the result of a calculation for longer than the lifetime of a query (see here and here for just two examples). Reading the new SSAS 2012/2014 Performance Guide, I spotted the following:
The use of MDX functions that are locale-dependent (such as CAPTION or .Properties) prevents the use of the global cache, because different sessions may be connected with different locales and cached results for one locale may not be correct for another locale.
It’s a bit vague but here’s an example of it happening. Take the following calculated measure for the Adventure Works cube:
If you run the following query, first on a cold cache and then on a warm cache:
In Profiler you will see that the second time it is run, the values for the calculated measure are returning from the Formula Engine cache:
Here, the Get Data From Cache event is showing that on the second execution of the query the Formula Engine cache is being used.
However, if you change the calculation so it uses the MEMBER_CAPTION property instead of the MEMBER_NAME property, like so:
Then run the query again, on a cold cache then a warm cache, you will see the following:
Here the Storage Engine cache is being used – and since the query only contains a single calculated measure, this means that the Formula Engine cache is not being used, and may result in noticeably worse overall query performance if your calculation is expensive.
The reason there is a difference between MEMBER_NAME and MEMBER_CAPTION is that the former property will return the same value for all users, but the latter may return different values for users in different locales. Therefore it is not safe to cache the results of calculations that use the MEMBER_CAPTION property because these calculations could return different values for different users for the same query.
One area where I see people have a lot of problems when they’re learning MDX is when they are working with dates. SQL has a lot of great functionality for working with dates while MDX, to be honest, does not – therefore people tie themselves in knots writing complex MDX expressions (which often don’t perform all that well) to recreate what they would do in SQL. However, as is often the case with MDX, approaching the questions from a completely different angle yields much better results.
For example, let’s imagine that you wanted to calculate the age in years of any customer on any given date. Using the Adventure Works cube you have a Date dimension, which will be how you want to select the date that the age is to be calculated at; there is also, on the Customer dimension, an attribute called Birth Date that gives you each customer’s date of birth. You need a calculated measure that calculates the age of each customer for any date selected on the Date dimension.
If you think about this as a SQL date problem, you’ll probably want to use functions like DateDiff() – one of the VBA functions that is available in MDX (but is not native MDX). The following article shows how DateDiff() and other VBA functions can be used to calculate ages in VBA code:
Here’s a query showing how this calculation can be implemented in MDX:
The output of the query is this:
On my laptop this query, which returns 365 columns and 106 rows, returns in around 3.5 seconds. A few things to point out:
- I’m using the .Member_Value function instead of the .MemberValue function to get the typed date value associated with each date, as Bill Anton’s recent blog post points out that it is marginally faster.
- I’m using .Properties(“Birth Date”, typed) to get the typed date value for each customer’s birth date. If I set the ValueColumn property on the Customer attribute of the Customer dimension, or enabled the hierarchy on the Birth Date attribute, I would be able to use .Member_Value here too and get a very minor performance improvement.
- Most of time spent by the query is taken by the yeardiff calculated member, which uses the DateDiff() function. This isn’t surprising because calling any VBA function in MDX is expensive and something to be avoided.
Performance here is ok, but we can do better. The most important thing to do when tuning any MDX calculation is to try to pre-calculate as much as possible. In this case there are two calculations we can move out of MDX into the structure of the cube:
- The DateDiff() function here just calculates the difference between the years of the two dates. Therefore if you create a new attribute on the Customer dimension to hold the integer representing the year of the customer’s birth date, and use the existing Calendar Year attribute on the Date dimension, you can avoid using DateDiff() completely and do a simple subtraction.
- Likewise, rather than calculating the combination of month and date in the MDX, these values can be pre-calculated in the dimension and then accessed as properties.
Here’s the rewritten version of the calculation, assuming that the following new attributes have been added to the Customer and Date dimensions:
- [Customer].[Birth Year] holds the year of the customer’s birth as an integer
- [Customer].[Birth Month Day] is an integer containing the month number multiplied by 100 plus the day number of the month for the customer’s birth date. For example the date 5th June would have the value 605.
- [Date].[Month Day] is the same as [Customer].[Birth Month Day] but for the Date dimension.
This query runs in around 1.2 seconds, a significant improvement. Of course if you were to create a completely new fact table that contained the age of every customer on every date in the Date dimension then you wouldn’t need to use any calculated measures and performance would be even better, but that may not be practical if you have a lot of customers and a lot of dates.
I was wondering the other day (as you do) which of the MDX cell properties Excel PivotTables actually support. This page has all the details on the cell properties that are available in an MDX query but most client tools don’t bother retrieving all of them and Excel is no different. Of course it retrieves the most important properties and it retrieves one or two others, but I couldn’t resist doing a bit of research to find out the exact situation with Excel 2013.
The first thing to note is that you can control which cell properties Excel retrieves for a given connection in the connection properties dialog, in the OLAP Server Formatting section:
If you can live with not getting all of the cell properties back from SSAS there are some scenarios where unchecking all of the boxes in the OLAP Server Formatting section can improve performance:
- When you have a large number of databases and cubes on your server, and/or complex security, because a side effect of the way Excel retrieves properties is that it causes all cubes in all databases to be loaded, their MDX Scripts executed and security evaluated. See here for more details.
- When your queries return a lot of data and your network is slow. See page 61 of the SQLCAT Guide to BI and Analytics for more details.
You can also quite easily see which cell properties Excel is retrieving by looking at the MDX queries it generates (thank you OLAP PivotTable Extensions). Here’s an example of a simple PivotTable query run on a connection which has all of the boxes checked in the OLAP Server Formatted section:
In the cell properties clause of this query you can see the six properties returned. Here’s a breakdown of each of them.
VALUE and FORMAT_STRING
Excel doesn’t actually retrieve the FORMATTED_VALUE cell property, which gives you the measure value with formatting applied by SSAS. Instead, assuming you have the Number Format box checked in the Connection Properties dialog, it retrieves the VALUE property from SSAS (which contains the raw, unformatted measure value) and the FORMAT_STRING property (which contains the format string you defined on the server). It then tries to convert the format string into an Excel format for the PivotTable. Unfortunately it can’t always do the conversion successfully – I blogged about a problem with the Percent format some time ago and this is still a problem with Excel 2013. Excel also doesn’t support formats defined in the fourth section (see here for more details) of the FORMAT_STRING property for null values, and there are probably lots of other relatively obscure types of formatting it doesn’t support too.
The language property of a cell controls things like the currency symbol displayed when you are using the “Currency” built-in format string and the thousands and decimal separators used. Whether Excel returns the language property or not is also linked to the Number Format checkbox in the Connection Properties dialog. I strongly recommend that you do not use the “Currency” format string and the Language property if you are working with multiple currencies, for reasons I outlined here.
BACK_COLOR and FORE_COLOR
I’ve never particularly liked using the BACK_COLOR and FORE_COLOR properties on a cell to do traffic light-style reporting, to be honest, although I know some people love it. However I do use these properties a lot when debugging scoped assignments as seen here. BACK_COLOR is retrieved if you have the Fill Color box checked on the connection properties dialog; FORE_COLOR is retrieved if you have the Text Color box checked on the connection properties dialog.
Probably the only surprise of this whole exercise was the fact that Excel retrieved the FONT_FLAGS property if you have the Font Style box checked on the connection properties dialog; alas it doesn’t support FONT_SIZE or FONT_NAME. For example if you put the following calculated members on your cube:
If you drop them into a PivotTable, you will see the following returned: