Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for June 2013

Optimising Returning Customers Calculations in MDX

with 11 comments

One of the more popular blog posts from my archives (86 comments so far) is the one I wrote on “Counting New and Returning Customers in MDX”. The trouble with all of the calculations in there is that they execute in cell-by-cell mode, and therefore perform quite badly.

For example, take the following query on Adventure Works to find the number of returning customers (customers who have bought from us today and have also bought something before in the past):

with

member measures.[Returning Customers V1] as

count(

intersect(

nonempty([Customer].[Customer].[Customer].members, [Measures].[Internet Sales Amount])

,

nonempty([Customer].[Customer].[Customer].members, 

    [Measures].[Internet Sales Amount] * {null : [Date].[Date].currentmember.prevmember})

)

)

 

select {measures.[Returning Customers V1]} on 0,

[Date].[Date].[Date].members

on 1

from 

[Adventure Works]

 

On a cold cache this takes 47 seconds on my laptop and a quick look in Profiler shows this executes in cell-by-cell mode. In the comments on the original post Deepak Puri suggested an alternative approach using the Customer Count distinct count measure:

with

member measures.customerstodate as

aggregate(null:[Date].[Date].currentmember, [Measures].[Customer Count])

 

member measures.customerstoprevdate as

([Date].[Date].currentmember.prevmember, [Measures].customerstodate)

 

member measures.newcustomers as

measures.customerstodate - measures.customerstoprevdate

 

member measures.[Returning Customers V2] as

[Measures].[Customer Count] - measures.newcustomers

 

select {measures.[Returning Customers V2]} on 0,

[Date].[Date].[Date].members

on 1

from 

[Adventure Works]

Interestingly, this performs even worse than the previous query (although I would have expected it to be better). So how can we write a query that returns in a reasonable amount of time?

I haven’t found a way to do this for a calculated measure defined on the server, to be used in a true ad hoc query environment like Excel (any suggestions welcome – please leave a comment if you can do it), but I have got a way of optimising this calculation for scenarios where you have control over the MDX being used, such as in SSRS.

Here’s the query:

with

 

set customerdates as

nonempty(

[Date].[Date].[Date].members

*

[Customer].[Customer].[Customer].members

, [Measures].[Internet Sales Amount])

 

set nondistinctcustomers as

generate(

customerdates,

{[Customer].[Customer].currentmember}, all)

 

member measures.customercountsum as

sum(null:[Date].[Date].currentmember, [Measures].[Customer Count])

 

member measures.[Returning Customers V3] as

count(

intersect(

subset(nondistinctcustomers

    , (measures.customercountsum, [Date].[Date].currentmember.prevmember)

    , [Measures].[Customer Count])

,

head(nondistinctcustomers

    , (measures.customercountsum, [Date].[Date].currentmember.prevmember))

)

)

 

 

select {measures.[Returning Customers V3]} on 0,

[Date].[Date].[Date].members

on 1

from 

[Adventure Works]

 

On my laptop, this query executes in around 5 seconds on a cold cache. The reason it’s so much faster is also the reason it can’t be used in ad hoc queries – it uses named sets to find all the combinations of customer date needed by the query in one operation. Here’s a step-by-step explanation of how it works:

  • First of all, the customerdates set gets a set of tuples containing every single combination of day and customer where a purchase was made, using a simple Nonempty().
  • Next, the nondistinctcustomers set takes the customerdates set and removes the dates, so what we are left with is a list of customers. It’s not a list of distinct customers, however – a given customer may appear more than once. This still represents a list of the customers that bought something each day, it’s just that we no longer have any information about which day we’re looking at.
  • The customercountsum measure allows us to take the list of customers in the nondistinctcustomers set and find out which customers bought something in any given day. It’s a running sum of the Customer Count measure. This is a distinct count measure, and usually you wouldn’t use the Sum() function on a distinct count, but it’s important we do here. How is it used? For example, let’s imagine we had just three days of data: on the first day we had three customers, on the second four customers and on the third five customers. That would result in the nondistinctcustomers set containing twelve (not necessarily distinct) customers. We can then use the running sum of a distinct count of customers to find out the index of the item in nondistinctcustomers that is the last customer in the list for each day. So on day two we would have a running sum of seven, and therefore the seventh item in nondistinctcustomers gives us the last customer in the list for that day.
  • Finally, the Returning Customers V3 measure gives us the number of returning customers each day. It uses the customercountsum measure to find the subsets of the nondistinctcustomers set that represent the customers that bought on the current day and the customers that bought on all days up to yesterday, then uses the Intersect() function to find the returning customers.

Written by Chris Webb

June 28, 2013 at 3:48 pm

Upcoming SSAS and MDX Training Course Dates

with 2 comments

If you’ve got some training budget burning a hole in your pocket, here’s a quick reminder of some upcoming SSAS and MDX courses I’m teaching:

Hope to see you there! If you’d like to find out about new courses by me and other SQL Server people, you can sign up to the Technitrain newsletter at http://www.technitrain.com

PS I’ve also got Andy Leonard coming over to London to run his ever-popular SSIS Design Patterns course too, in early September.

Written by Chris Webb

June 26, 2013 at 9:30 am

Posted in Events, Technitrain

Flattening A Parent/Child Relationship In Data Explorer (Power Query)

with 4 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

I was teaching my SSAS cube design and performance tuning course this week (which I’ll be teaching in Sydney and Melbourne next month, along with some MDX – places still available!) and demonstrating BIDS Helper’s excellent functionality for flattening parent/child relationships, and it got me thinking – can I do the same thing in Data Explorer? Not that I need to do this in Data Explorer, you know, but it’s the kind of challenge I like to set myself. Of course you can do it, and quite elegantly, and since I learned yet more interesting stuff about Data Explorer and M while I was cracking this problem I thought I’d blog about it.

Here’s what I want to do. Consider the parent/child hierarchy in the DimEmployees table in the Adventure Works DW database:

image

Each row represents an employee, EmployeeKey is the primary key and ParentEmployeeKey is the key of the employee’s boss. Therefore, by joining the table to itself, we can recreate the org chart of the Adventure Works company (ie who reports to who). The problem though is that we need to join the table to itself multiple times to do this, and the number of times we need to do the join depends on the data itself. If you flatten a parent/child hierarchy by doing this, the end result should have a series of columns representing each level in the hierarchy, and look something like this:

image

This problem can be solved in SQL reasonably easily, even if the SQL you end up writing might look a little scary (see the views that BIDS Helper generates for an example of this). What about Data Explorer?

At the heart of my approach was a recursive function. I’ve blogged about creating functions in Data Explorer already, so you might want to read that post for some background. Here’s my function declaration:

let

    Source = (FromTable, KeyColumn, ParentKeyColumn, ToTable, optional Depth) =>

let

    GetDepth = if (Depth=null) then 1 else Depth,

    GetKeyColumn = if (Depth=null) then KeyColumn

        else Number.ToText(GetDepth-1) & "." & KeyColumn,

    GetParentKeyColumn = Number.ToText(GetDepth) & "." & ParentKeyColumn,

    JoinTables = Table.Join(FromTable,{GetKeyColumn},

        Table.PrefixColumns(ToTable , Number.ToText(GetDepth)),

            {GetParentKeyColumn}, JoinKind.LeftOuter),

    FinalResult = if

        List.MatchesAll(Table.Column(JoinTables, GetParentKeyColumn), each _=null)

        then FromTable

        else RecursiveJoin(JoinTables, KeyColumn, ParentKeyColumn, ToTable, GetDepth+1)

in

    FinalResult

in

    Source

A few interesting things to point out:

  • I’ve used a LET statement inside my function declaration, so I can have multiple statements inside it
  • I’ve used Table.Join to do the left outer join between the two tables I’m expecting
  • The parameters I’m using are:
    • FromTable – the table on the left hand side of the join. When the function is first called, this should be a table that contains the Employees who have no parents (ie where ParentEmployeeKey is null); when the function calls itself, this will be the result of the join.
    • ToTable – the table on the right hand side of the join. This is always a table that contains the Employees who do have parents.
    • KeyColumn – the name of the Employee’s key column
    • ParentKeyColumn – the name of the Employee’s parent key column
  • I’ve used Table.PrefixColumn to rename all the columns in the table on the right hand side of the join, prefixing them with the depth of the call stack, so I get distinct column names.
  • The function calls itself until it finds it has done a join where the last ParentKeyColumn contains only null values. I’ve used List.MatchesAll to check this.

Here’s the call to this function – you only need to include one step in the Data Explorer query to do this – to return the flattened structure:

= RecursiveJoin(

    Table.SelectRows(Employees, each [ParentEmployeeKey]=null),

    "EmployeeKey",

    "ParentEmployeeKey",

    Table.SelectRows(Employees, each [ParentEmployeeKey]<>null)

    )

And here’s the output:

image

In this case the output isn’t exactly the same as what BIDS Helper might produce, because BIDS Helper has some special requirements for SSAS user hierarchies. Also, since I’m still learning Data Explorer and M, I’m not sure my code in the most efficient, elegant way. But I still think it’s an interesting example and I hope it’s useful to other Data Explorer enthusiasts out there – we’re a small but growing band!

You can download my demo workbook here.

Written by Chris Webb

June 22, 2013 at 3:13 pm

A New Events-In-Progress DAX Pattern

with 25 comments

I’ve been working on a very complex SSAS Tabular implementation recently, and as a result I’ve learned a few new DAX tricks. The one that I’m going to blog about today takes me back to my old favourite, the events-in-progress problem. I’ve blogged about it a lot of times, looking at solutions for MDX and DAX (see here and here), and for this project I had to do some performance tuning on a measure that uses a filter very much like this.

Using the Adventure Works Tabular model, the obvious way of finding the number of Orders on the Internet Sales table that are open on any given date (ie where the Date is between the dates given in the Order Date and the Ship Date column) is to write a query something like this:

EVALUATE

ADDCOLUMNS (

    VALUES ( 'Date'[Date] ),

    "OpenOrders",

    CALCULATE (

        COUNTROWS ( 'Internet Sales' ),

        FILTER( 'Internet Sales', 'Internet Sales'[Ship Date] > 'Date'[Date] ),

        FILTER( 'Internet Sales', 'Internet Sales'[Order Date] <= 'Date'[Date] )

    )

)

ORDER BY 'Date'[Date]

On my laptop this executes in around 1.9 seconds on a cold cache. However, after a bit of experimentation, I found the following query was substantially faster:

EVALUATE

ADDCOLUMNS (

    VALUES ( 'Date'[Date] ),

    "OpenOrders",

    COUNTROWS(

        FILTER(

            'Internet Sales',

            CONTAINS(

                DATESBETWEEN('Date'[Date]

                    , 'Internet Sales'[Order Date]

                    , DATEADD('Internet Sales'[Ship Date],-1, DAY))

                , [Date]

                , 'Date'[Date]

            )

        )

    )

)

ORDER BY 'Date'[Date]

On a cold cache this version executes in just 0.2 seconds on my laptop. What’s different? In the first version of the calculation the FILTER() function is used to find the rows in Internet Sales where the Order Date is less than or equal to the Date on rows, and where the Ship Date is greater than the Date. This is the obvious way of solving the problem. In the new calculation the DATESBETWEEN() function is used to create a table of dates from the Order Date to the day before the Ship Date for each row on Internet Sales, and the CONTAINS() function is used to see if the Date we’re interested in appears in that table.

I’ll be honest and admit that I’m not sure why this version is so much faster, but if (as it seems) this is a generally applicable pattern then I think this is a very interesting discovery.

Thanks to Marco, Alberto and Marius for the discussion around this issue…

UPDATE: Scott Reachard has some some further testing on this technique, and found that the performance is linked to the size of the date ranges. So, the shorter your date ranges, the faster the performance; if you have large date ranges, this may not be the best performing solution. See https://twitter.com/swreachard/status/349881355900952576

UPDATE: Alberto has done a lot more research into this problem, and come up with an even faster solution. See: http://www.sqlbi.com/articles/understanding-dax-query-plans/

Written by Chris Webb

June 13, 2013 at 10:32 am

BI Survey 13

leave a comment »

Just a quick post to say that the BI Survey 13 is now open and would like your feedback about the BI tools you’re using. Here’s the link:
https://digiumenterprise.com/answer?link=1358-CTDDUWQN

It’s worth doing because you’ll get a summary of the results and will get entered in a draw for some Amazon vouchers; and of course you get to make sure that lots of nice things are said about the Microsoft BI stack! I’ll blog about the results later in the year when they come out – the findings are always very interesting.

Written by Chris Webb

June 8, 2013 at 11:40 pm

Posted in Uncategorized

Analysis Services Multidimensional Now Works With Power View–And Why That’s Important

with 22 comments

By now you may have already heard the news that, as part of SQL Server 2012 SP1 CU4, new functionality has been released that means that Power View now works with Analysis Services Multidimensional (ie cubes, as opposed to the Tabular Model, which always worked with Power View). I won’t bother to repeat the technical details which you can read about here:
http://blogs.msdn.com/b/analysisservices/archive/2013/05/31/power-view-connectivity-for-multidimensional-models-released.aspx

…but the main points are that Analysis Services Multidimensional can now be queried in DAX, and this plus some tweaks to Power View mean that the two can be used together for the first time. Unfortunately Power View in Excel 2013 doesn’t work with Analysis Services Multidimensional yet, but I hope that will also be fixed very soon.

I’ve been playing with the public CTP of this for a while and done a few presentations with it, and from a technical point of view it’s a solid bit of work by the Analysis Services dev team. It just works, and while there are a few limitations they’re trivial. Arguably it should not have been necessary to do it in the first place – why didn’t Power View speak MDX when it was built, which would have meant it could have queried both Tabular and Multidimensional? But it’s here now, and that’s what counts. It also opens up some interesting possibilities for using DAX queries to create detail-level reports on cubes, and also for defining DAX calculations inside those queries.

However I think its real importance is strategic. This is the first significant bit of new functionality in Analysis Services Multidimensional for a long while, and it acts as a bridge between the classic SQL Server BI stack that most of us are using and the brave new world of Office/Sharepoint-led BI. It is also the first time in a long time that Analysis Services Multidimensional users have had a dedicated client tool for data analysis from Microsoft that isn’t Excel. Don’t get me wrong, I love Excel as a client tool for SSAS but I’ve always thought (and I think industry trends over the last few years support this view) that even though Excel is a great way to bring data analysis to the masses, there’s still an important niche among power users for a more advanced data analysis and data visualisation tool.

You may be thinking at this point that pretty graphs and charts are all very well, but your users don’t need anything other than the SSRS reports and basic PivotTables that they’ve been using for the last few years. I say that you ignore Power View at your own risk. Microsoft’s competitors in the BI space are hungry for new customers and are interested in migration projects. You might well arrive at the office next Monday morning to find that there’s a new CFO who used QlikView in his last job, and who wants the same pretty graphs and charts he had there again. It’s not going to be any use arguing that you’ve spent years developing this cube, that it’s lightning fast and has all sorts of tricky business logic coded into thousands of lines of MDX – if your BI solution’s user interface looks and feels dated, then whatever its technical merits it will have the musty smell of legacy software about it. If, however, you can fire up a VM with Sharepoint 2013 and Power View on and show off some slick dashboards created from your existing cubes, even if this is something the majority of your end users wouldn’t really be interested in (and you may be wrong, they might love it), you’re going to be showing the business two important things:

  • Microsoft can do sexy dashboards and visualizations too, and while they come at a price, that price is probably a lot less than it would cost to rip and replace what you’ve got with a competitor’s software. So the option’s there if you want to spend the money and do the upgrade.
  • Analysis Services cubes are not a dead-end, and Microsoft has made a significant investment here to prove this. I’d still love to see a coherent roadmap that explains where Microsoft is heading with its BI tools and how it expects its existing customers to get there, but I doubt we’ll get one. This functionality was, however, delivered in response to popular demand, so I’m hopeful that if we as customers can make our voices heard as to what we want in the future then we can influence Microsoft’s direction.

So go forth and Power View. Both Rob Kerr and Koen Verbeeck have recently published some excellent, detailed guides to setting up a Sharepoint 2013 demo environment; you have no excuse for not testing this out and being ready to face the competition.

Written by Chris Webb

June 2, 2013 at 9:25 pm

Follow

Get every new post delivered to your Inbox.

Join 3,112 other followers