Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Simple Basket Analysis in DAX

with 2 comments

Continuing the theme of finding the distinct number of customers who’ve done something specific, I’ve recently been having a go at implementing simple basket analysis in DAX. So, for example, as well as finding the distinct number of customers who bought Apples, we may also want to find the distinct number of customers who bought both Apples and Oranges, or even those who bought Apples and Oranges but not Cake.

To illustrate this problem I used the following sample data entered into Excel tables as the basis for a PowerPivot model:



So what we have here is essentially a Product dimension table, a Customer dimension table and a Sales fact table. I did the obvious thing and set up relationships going from Sales to Product and Sales to Customer, but I also created two extra tables in PowerPivot that were built from the Product table: one to allow the selection of Products for the AND filter, and one to allow for the selection of Products in the NOT filter, neither of which were related to any other table. So when building the query to show the number of customers that bought Apples and Oranges but not Cake, the user would select Apples on the Product table, Oranges on the AND Product table and Cake on the NOT Product table.

The best and fastest way I found of writing the DAX I needed was to break the problem up into several smaller calculated measures. The starting point, to get the number of customers that bought Apples, is a simple calculated measure like this:

=COUNTROWS(DISTINCT(Sales[Customer Key]))

I then created two calculated measures: one that returned the distinct number of customers who had bought the AND Product, ignoring the selections made on Product and NOT Product; and one that returned the number of customers who had bought the NOT Product, ignoring the selections made on Product and AND Product. Here they are:

=CALCULATE(
COUNTROWS(DISTINCT(Sales[Customer Key]))
, FILTER(ALL(Product),
COUNTROWS(
FILTER(DISTINCT(‘Product Bought With’[And Product Key]), Product[Product Key] = ‘Product Bought With’[And Product Key]))
>0))

=CALCULATE(
COUNTROWS(DISTINCT(Sales[Customer Key]))
, FILTER(ALL(Product),
COUNTROWS(
FILTER(DISTINCT(‘Product Not Bought’[Not Product Key]), Product[Product Key] = ‘Product Not Bought’[Not Product Key]))
>0))

They both work in the same way: they find the distinct count but shift the context on the Product table (which has a relationship with Sales remember, whereas AND Product and NOT Product don’t) so that whatever has been selected on Product is ignored and, the selections made on AND Product and NOT Product are, respectively, used instead. To do this I’m using two nested filters – one which loops over every row in the Product table, and then an inner filter that loops over every selected member on AND/NOT Product; they return the rows from Product where the Product Key appears in the list of distinct Product Keys selected on AND/NOT Product.

At this point we can build a pivot table that looks like this:

With Customers on rows, obviously, no distinct count is greater than 1, but we can now see that Chris, Helen and Mimi bought Apples; that Chris and Natasha bought Oranges; and that Mimi and Natasha bought Cake. We can now use these measures to create our calculated measure that returns the distinct number of customers that bought Apples and Oranges but did not buy Cake:

=CALCULATE(COUNTROWS(DISTINCT(Sales[Customer Key])), FILTER(DISTINCT(Sales[Customer Key]), [Customers Buying Not Product]=0 && [Customers Buying And Product]>0))

It’s a variation on the original distinct count measure, but now we’re using Calculate once again to shift the context so we take the distinct customers and apply an additional filter to ensure they didn’t buy whatever was selected on NOT Product but did buy whatever was selected on AND Product.

At last we can see that Chris was the only customer that bought Apples and Oranges but did not buy Cake. We can also run queries like this that don’t display individual customers:

There’s one last problem to solve: what happens if we don’t want to apply the AND or NOT filter for some queries? It doesn’t really make much sense to select the All member on either of these tables (for example, you’d never want to find the number of Customers that bought Apples and bought any Product and did not buy any Product) so we can alter the calculations to turn off the AND and NOT filter when the All member has been selected on either the AND or NOT Product table. Here are the new versions of the two calculated measures that return the distinct counts for the AND and NOT products:

=IF(
COUNTROWS(DISTINCT(‘Product Bought With’[And Product Key]))=COUNTROWS(ALL(‘Product Bought With’[And Product Key]))
, 1
, CALCULATE(
COUNTROWS(DISTINCT(Sales[Customer Key])),
FILTER(ALL(Product),
COUNTROWS(
FILTER(DISTINCT(‘Product Bought With’[And Product Key]), Product[Product Key] = ‘Product Bought With’[And Product Key]))
>0)))

=IF(
COUNTROWS(DISTINCT(‘Product Not Bought’[Not Product Key]))=COUNTROWS(ALL(‘Product Not Bought’[Not Product Key]))
, 0
, CALCULATE(
COUNTROWS(DISTINCT(Sales[Customer Key])),
FILTER(ALL(Product),
COUNTROWS(
FILTER(DISTINCT(‘Product Not Bought’[Not Product Key]), Product[Product Key] = ‘Product Not Bought’[Not Product Key]))
>0)))

Here I’ve wrapped the previous expressions in an IF that checks if the number of distinct Product Keys in the selection matches the number of distinct Product Keys in the entire table – if it does then the All member has been selected, and the first expression then returns a 1 and the second returns a 0 to ensure that the filter conditions on these measures in the final calculated measure return true when the All member is selected. We can then run simpler queries like this one showing the customers who bought both Apples but not Cake, with no AND filter applied:

As always I’d be interested in hearing from anyone that’s got a better way of solving these problems. I’m aware I’ve used the evil Filter function all over the place and that that isn’t always good for performance (although on larger data sets this approach seemed to perform ok) so I wonder if there’s a way to avoid its use anywhere?

Written by Chris Webb

October 20, 2010 at 10:47 pm

Posted in DAX

2 Responses

Subscribe to comments with RSS.

  1. Chris, you know, basket analysis model is just like survey model and I have a long story with that! :)
    My approach is “model first” and by duplicating the Sales table in PowerPivot I defined this “Distinct Customers with Not and And Applied” measure:
    = CALCULATE( COUNTROWS( Customer ),
    FILTER( Customer, CALCULATE( COUNTROWS( Sales ) ) > 0 ),
    FILTER( Customer, CALCULATE( COUNTROWS( SalesAnd ) ) > 0 ),
    FILTER( Customer, CALCULATE( COUNTROWS( SalesNot ) ) = 0 ) )

    I don’t know if it is faster (it might be) but is much more readable (at least to me!) – and no special cases for All selection are needed.

    The complete workbook is available here:

    http://cid-72aabb34d241c821.office.live.com/view.aspx/PowerPivot/Basket%20Analysis.xlsx

    Feedbacks are welcome!

    Marco Russo

    October 23, 2010 at 11:04 am

  2. [...] taking the PowerPivot model and calculated measures from my last post, I started to look at how to import that data into the NodeXL Excel template. The first problem was [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,867 other followers

%d bloggers like this: