Archive for July 2014
Recently I had to solve the same, rather complex, problem for three different customers within the space of two weeks, a variation of the ‘events-in-progress’ problem I’ve blogged about a lot in the past. It’s this one: how can you calculate the value of your overdue invoices on any given date? It’s important to stress that we are not talking about the invoices that are overdue today – we want to be able to pick any date in the past and find out the value of invoices that were overdue at that point in time.
Let’s imagine you have a fact table containing payments against invoices: each row has an invoice number, the date of the payment, the payment amount, and the date that the invoice is due. A positive value in the Amount column indicates that this is the opening amount of the invoice; negative values in the Amount column are payments against the invoice. Multiple payments can be made against an invoice before it is fully closed. Here’s some example data:
Looking at the rows highlighted for invoice 5, you can see in the upper box that there is a positive amount of £35 shown on January 1st 2014 – this is the date that the invoice was opened, and £35 is the full value of the invoice. The invoice is due on January 10th 2014. In the lower box you can see there were four separate payments of £5, £5, £10 and £15 before the invoice was fully paid off on January 12th 2014.
Now, if you were to do a total-to-date in MDX (similar to what I describe here) it would be pretty easy to calculate the outstanding amount on all invoices on any given date. However the requirement here is not only to do that, but to break the value down so that you can see what the total value for overdue invoices and non-overdue (which I’m going to call backlog in this post) invoices is on any given date. This is tricky to do because we have to take the Due Date of each invoice into account as well as the Payment Date. The desired output for invoice 5 is this:
As you can see in this PivotTable, for invoice 5 £25 was still outstanding on January 9th 2014, but this is shown as backlog because this is before the due date of January 10th. On January 11th, one day after the due date, the remaining outstanding amount of £15 is shown as overdue. Of course, we also want to be able to calculate the correct values for all invoices:
One way of solving this problem would be to calculate the overdue and backlog values for each date that each invoice is open in your ETL, and store these values in a snapshot fact table. This works, and will give you the best possible query performance, but it has two major drawbacks: first, it makes your ETL much slower, and second it makes your fact table much larger. This post describes how you can calculate the overdue and non-overdue values on any given date using many-to-many relationships instead, without having to blow out the fact table.
To go along with the fact table (called FactInvoice) shown above, I have a date dimension table called DimDate (I have deliberately reduced the number of rows here to the dates I have data for, for reasons that will become clear soon):
I have a dimension table containing all of the days that an invoice can be overdue or not overdue for (which has one less than double the number of rows as the date dimension table), plus a second column classifying each row as ‘Backlog’ or ‘Overdue’:
… and an invoice dimension table that just contains the distinct invoice numbers called DimInvoice.
Now, let me explain how to build the SSAS cube.
- Build dimensions from all of the tables shown above, so you have dimensions called Date, Overdue Days and Invoice.
- Build a cube with one measure group, based on FactInvoice, and create one measure with AggregateFunction Sum based on the Amount column in that table.
- Add the Invoice dimension to the cube with a regular relationship. Add the Date dimension to the cube twice, as Payment Date and Due Date, with regular relationships on the PaymentDate and DueDate columns. The Dimension Usage tab should look like this:
- Create the following view in SQL Server:
- Add this view to the DSV and create a new measure group from it; you’ll need to create a measure here, but it can just be a Count measure.
- Add the Date dimension to the cube once again, this time leaving the name as Date (you will now have three role-playing copies of the Date dimension in the cube).
- Set up regular relationships between the Date and Payment Date dimensions and the new measure group, then a many-to-many relationship between Date and the Invoice measure group. This means that when you query the cube by the Date dimension, the many-to-many relationship will mean you see the sum of all Amounts whose payment date is up to and including the date selected. The Dimension Usage tab will look like this:
CREATE VIEW [dbo].[FactDateToPaymentDate]
SELECT a.DateKey, b.DateKey AS PaymentDateKey
FROM dbo.DimDate AS a INNER JOIN
dbo.DimDate AS b ON a.DateKey >= b.DateKey
This view returns all of the combinations of a given date and all dates up to and including the date.
- Create another view in SQL Server with this definition:
- Add this view to the DSV and build another measure group from it, again with a single Count measure on it.
- Set up a regular relationship between this new measure group and the Due Date dimension.
- Add the Overdue Days dimension to the cube, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group.
- Add the Date dimension to the cube yet again, creating a new role-playing dimension called As Of Date, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group. The Dimension Usage tab will now look like this:
CREATE VIEW [dbo].[FactDueDateToAsOfDate]
SELECT a.DateKey AS DueDateKey, b.DateKey AS AsOfDateKey, DATEDIFF(dd, a.FullDate, b.FullDate) AS OverDueDays
FROM dbo.DimDate AS a CROSS JOIN
dbo.DimDate AS b
Yes, I am cross joining the DimDate table with itself and yes, this could return a lot of rows. However you should find that the view is very fast to execute.
The purpose of this measure group is this: if you select a date on the As Of Date dimension, you will be able to select ‘Overdue’ on the Overdue Days dimension and this will give you all of the dates on Due Date that were overdue on that date.
- You only want to have to select one date in your PivotTable, so create the following MDX calculated measures that take your selection on the Date dimension and applies it to the As Of Date dimension too:
CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount Hidden] AS
LINKMEMBER([Date].[Date].CURRENTMEMBER, [As Of Date].[Date])
CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount] AS
IIF(MEASURES.[Open Amount Hidden]=0, NULL, MEASURES.[Open Amount Hidden]);
- Hide all measures apart from the calculated measures you’ve just created
- Hide the Payment Date, Due Date and As Of Date dimensions
This is a very complex pattern, I know, and this is after I’ve simplified it a lot (if you need currency conversion as well then things get even worse) but I also know it’s extremely useful from a business point of view. Query performance is also reasonably good, at least in the places where I have implemented this.
You can download my sample SQL Server database and VS 2012 project here.
One of the more recent additions to Power Query is the ability to access data from named ranges in the Excel worksheet rather than an Excel table. I’ve got used to formatting data as tables in Excel because that’s what Power Pivot needs to import data directly from the worksheet, but if you are working with Power Query and a pre-existing workbook then trying to reformat data as tables can be a pain. Also, if you just want to import a single value, for example as a parameter to a query, a table seems like overkill.
(Incidentally, if you’re wondering what a named range is in Excel, there are tons of good introductions to the subject on the internet like this one. You can do loads of cool stuff with them.)
Consider the following Excel worksheet:
There are three named ranges here: FirstRange, SecondRange, ThirdRange, and the values in the cells show which range the cells are in. FirstRange consists of two cells in two columns; SecondRange consists of three cells in a single row; and ThirdRange is consists of three, non-contiguous cells. (You can also use this trick to display the names of all contiguous ranges in an Excel workbook, but alas it does seem to work for non-contiguous ranges).
At the moment, the Power Query ribbon doesn’t make it obvious that you can use named ranges as data sources. However you can see all the tables and cells in a worksheet, and even return that list from a query, by creating a blank query and using the expression
Once you’ve done this you can see all the ranges (and also any tables) in the workbook, and click on the table link next to the name to see the data. For example, clicking on FirstRange shows the following table in a new step in the query editor:
The expression to get at this table in a single step is:
The range SecondRange in my example is equally straightforward to reference, and you can see its contents by using the expression
Unfortunately ThirdRange, which is not contiguous, is a problem: I can only get the first cell in the range. So the expression
Returns just this table:
It would be nice if we could get a list containing the cell values, rather than a table, for ranges like this…
Last thing to mention is that if you do want the value in a cell, rather than a table, you just need to right-click inside the cell in the Query Editor and select Drill Down:
This returns the value (in this case the text “Third Range Cell 1”) in the cell you clicked on:
This is a much more useful value to return than a table containing a single row/column, if you intend to use a value from a single cell in a named range as a parameter to another query.
You can download the sample workbook for this post here.
Seems like another new bit of Power BI functionality got released today: the ability to optimize your data model for Q&A in the browser. Here’s the link to the docs:
Previously, the ability to add synonyms to your model to improve the results you got from Q&A was only available in Excel on the desktop, inside the Power Pivot window. Now you can do this, as well as new stuff like add phrasings (described here) and view usage reports, in your Power BI site.
I won’t repeat what the docs say about the actual functionality, but this seems to be yet more evidence that Excel on the desktop is no longer the central hub for Power BI. If this is the case, this is a massive strategic change, and I can understand why it has happened: the need for the ‘right’ version of Excel on the desktop is a massive roadblock for Power BI adoption, especially in enterprise accounts (see also Jen Underwood’s comments on this from yesterday). Maybe now it’s BI in the browser instead?
OK, so I’m not at WPC this year but I have just watched this video of Scott Guthrie’s session “The Cloud for Modern Business”. If you’re interested in seeing some new Power BI features take a look at the demo by James Phillips, general manager for Power BI, starting at 21:20:
Some of the new things I noticed:
- 21:40 – a nice shot of one of the new Power BI dashboards first announced at the PASS BA Conference earlier this year. You can see several new types of visualisation such as treemaps, radar charts and gauges (gauges? GAUGES? Shhh, don’t tell Stephen Few).
- 22:33 – a list of out-of-the box data sources is shown from which new models can be created. They include: Salesforce, MS Dynamics, Facebook, Google Analytics, Twitter, and Upload Excel.
- 22:50 – data is imported from Salesforce in the browser. This isn’t happening in Excel on the desktop, folks, it’s in the browser. This is significant!
- 23:10 – another new visualisation shown, a doughnut chart (if that’s the right term). I see names of people from the Power Query team in the data.
- 24:50 – a Q&A analysis is pinned to the dashboard
- 25:50 – much is made of the fact that the dashboard is touch-enabled
- 25:55 – “Partner Solution Packs” are announced. This sounds important! It seems to be referring to the Salesforce demo earlier, and these solution packs are said to include: data, connectivity to the data sources, visualisation and interactive reports. So it sounds like Microsoft are going to encourage data vendors (or other sources of data) to build these solution packs on top of Power BI as pre-packaged analytical apps. Probably a good idea.
- 26:15 – editing a dashboard in the browser and swapping one visualisation for another. Again, the HTML 5 browser based editing experience – we haven’t seen Excel once in this demo.
- 27:55 – “If there was ever a partner opportunity, this is it”. Again much emphasis here. Seems like these new Power BI features, especially the solution packs, are aimed at giving partners incentives to sell and customise Power BI (something which they have not had up to now, to be honest).
Oh, and you probably already heard that Azure Machine Learning is now in public preview. Check out the docs and samples here. I wouldn’t be surprised if there was some integration between this and Power BI to come too.
Looking for some summer holiday (or winter holiday, depending on which hemisphere you live in) reading? If so, may I suggest my new Power Query book? “Power Query for Power BI and Excel” is available now from the Apress site, Amazon.com, Amazon.co.uk and all good bookstores.
It’s an introductory level book. It covers all of the stuff you can do in the UI, it has a chapter on M, and it goes into a reasonable amount of detail on more advanced topics; it is not a 500-page exhaustive guide to the product. I’ve focused on readability and teaching the fundamentals of Power Query rather than every looking at every obscure M function, but at the same time if you’ve already used Power Query I think there’ll be plenty of material in there you’ll find interesting.
Now for the bad news: the book is out-of-date already, although not by much. One of the best things about Power Query is the monthly release cycle; unfortunately that makes writing a book on it a bit of a nightmare. I started off writing in January and had to deal with lots of added functionality and changes to the UI over the next few months; I had to retake pretty much all of the screenshots as a result. The published version of the book is based on the version of Power Query that was released in early June rather than the current version. Hopefully you can forgive this – the differences are minor – but it’s a good reason to buy the book as soon as you can! I want to do a second edition in a year’s time once (if?) the release cycle slows down.
I’ve been teased a bit for blogging and teaching so much about Power Query recently, so the final thing I want to say here is why an old corporate BI/SSAS guy like me is getting so excited about a self-service ETL tool. Well, the main reason is that Power Query is a great piece of software. It does what it does very well; it does useful things rather than what the marketing guys/analysts/journalists think is hot in BI; it is easy to use but at the same time is flexible enough for the advanced user to do really complex stuff; it is updated regularly based on feedback from its users. I only wish all Microsoft software was this good… Honestly, I wouldn’t be able to motivate myself to blog and write about Power Query if I didn’t think it was cool, and even though it hasn’t been hyped in the same way as other parts of the Power BI stack it is nonetheless the part that people get excited about when I show them Power BI. It’s not just me either – every day I see positive comments like Greg Low’s here. I think it is as important, if not more important, than Power Pivot and I think it will be a massive success.
Oh, and did I mention that I’m also teaching a Power Query course in London later this year….?
The Unorder() function is probably one of the least used functions in the whole of MDX. It exists only as a query performance hint and, since I had never up to now found a scenario where it did improve the performance of a calculation I had pretty much forgotten about it (as Books Online says, the optimisation it performs is applied automatically in many cases). However I was playing around with some calculations last week and found out that it does have its uses…
What does the Unorder() function do? It’s a function that takes a set and returns a set, and what it does is remove any implicit ordering from that set. By default all sets in MDX are ordered, but for some types of operation that ordering is unimportant and ignoring it can result in faster query performance.
Take, for example, the following query on the Adventure Works cube which shows the number of customers who have bought something up to the current date:
On my laptop it executes in 35 seconds on a cold cache. We can optimise the calculation here simply by wrapping the set of all members on the Customer level of the Customer hierarchy with the Unorder() function, so:
The following query now executes in 27 seconds on a cold cache:
As far as I can tell, Unorder() only makes a difference on calculations when used in combination with NonEmpty(), and when it is used over a large set (here the set of customers has around 18000 members). If you have calculations like this I would recommend testing to see if Unorder() makes a difference – if it does, please leave a comment and let me know what you find!