Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for November 2013

Pivoting Data In Power Query

with 2 comments

One of the features I’ve loved right from the start in Power Query is the ability to UnPivot data right from the UI (I’m not alone – Ken Puls has written a good blog post summing up this functionality and is equally impressed). However earlier this year when I was trying to implement Life in Power Query, I found that I needed to do the opposite: I needed to pivot data. I was able to do this with some pretty complex M, but since then the nice people on the Power query team have added a new Table.Pivot() function to allow this to be done easily. Unfortunately pivoting is not something that can be done using the UI alone (at least not at the time of writing), but it’s pretty straightforward to write your own M expression to use this new function.

Let’s take the following table as a starting point:

image

You can see, in the Measure column, we have values Sales Units and Sales Value that tell us what the numbers in each row in the Value column represent. For most purposes, it’s better to pivot this data so that we have two columns, one for Sales Units and one for Sales Value. You can do this pivot operation in a single step using Table.Pivot(). Here’s an example script that loads the data from the table above and pivots it:

let

    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],

    Pivot = Table.Pivot(Source, {"Sales Units", "Sales Value"}, "Measure", "Value")

in

    Pivot

Here’s the output:

image

It’s worth pointing out that if you don’t include each distinct value from your chosen column in the second parameter, those rows will be lost after the pivot takes place. So

= Table.Pivot(Source, {"Sales Units"}, "Measure", "Value")

Returns just:

 
image

 

Listing out all the column names in that second parameter is a bit of a pain, so we can get a list with all the distinct values in that column and improve our expression as follows:

= Table.Pivot(Source, List.Distinct(Table.Column(Source, "Measure")), "Measure", "Value")

This returns the same result as our first query – it uses Table.Column() to get a list of all the values in the Measure column, then List.Distinct() to return only the distinct values from that list.

Finally, there is an optional fifth parameter that can be used to aggregate data. The following source table has two rows for Sales Value for March:

image

If I use my original expression the pivot will work but I’ll get an error value in the cell for Sales Value for March:

image

Specifying a function to aggregate the data in these scenarios, where two cells need to be pivoted to one, solves the problem:

= Table.Pivot(Source, {"Sales Units", "Sales Value"}, "Measure", "Value", List.Sum)

image

In this case I’m passing the function List.Sum() to get the value 10+15=25 in the cell for Sales Value for March.

Written by Chris Webb

November 25, 2013 at 9:30 am

Posted in Power Query

Generating A Date Dimension Table In Power Query

with 28 comments

There are hundreds of different methods for generating a Date dimension table for use in Power Pivot: you can use TSQL, Excel formulas, import from the DateStream table in the Windows Azure Marketplace, and there’s even an app for it. Nevertheless I thought it would be a good idea to solve this problem in Power Query because:

  • It seems like the natural tool to use, and a date table is one of the first tables a data steward will want to share with his/her users.
  • It’s a very good way to learn about and demonstrate all the date functionality in M and Power Query. The more M code examples that there are out there, the better, right?
  • Having just signed the contract to write a whole book on Power Query I need to master M very quickly!

Here’s how I did it. First of all, I decided to implement a function to create the table rather than a regular query; this function takes two parameters: the start date and the end date of the range of dates to appear in the table. Here’s the function I came up with:

let

    CreateDateTable = (StartDate, EndDate) =>

let

    /*StartDate=#date(2012,1,1),

    EndDate=#date(2013,12,31),*/

    //Create lists of month and day names for use later on

    MonthList = {"January", "February", "March", "April", "May", "June"

                 , "July", "August", "September", "October", "November", "December"},

    DayList = {"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"},

    //Find the number of days between the end date and the start date

    NumberOfDates = Duration.Days(EndDate-StartDate),

    //Generate a continuous list of dates from the start date to the end date

    DateList = List.Dates(StartDate, NumberOfDates, #duration(1, 0, 0, 0)),

    //Turn this list into a table

    TableFromList = Table.FromList(DateList, Splitter.SplitByNothing(), {"Date"}

                     , null, ExtraValues.Error),

    //Caste the single column in the table to type date

    ChangedType = Table.TransformColumnTypes(TableFromList,{{"Date", type date}}),

    //Add custom columns for day of month, month number, year

    DayOfMonth = Table.AddColumn(ChangedType, "DayOfMonth", each Date.Day([Date])),

    MonthNumber = Table.AddColumn(DayOfMonth, "MonthNumberOfYear", each Date.Month([Date])),

    Year = Table.AddColumn(MonthNumber, "Year", each Date.Year([Date])),

    DayOfWeekNumber = Table.AddColumn(Year, "DayOfWeekNumber", each Date.DayOfWeek([Date])+1),

    //Since Power Query doesn't have functions to return day or month names, 

    //use the lists created earlier for this

    MonthName = Table.AddColumn(DayOfWeekNumber, "MonthName", each MonthList{[MonthNumberOfYear]-1}),

    DayName = Table.AddColumn(MonthName, "DayName", each DayList{[DayOfWeekNumber]-1}),

    //Add a column that returns true if the date on rows is the current date

    IsToday = Table.AddColumn(DayName, "IsToday", each Date.IsInCurrentDay([Date]))

in

    IsToday

in

    CreateDateTable

 

Some things to point out:

  • Look – comments! Both multi-line /* */ and single line //. Adding comments to complex Power Query queries and functions is of course a Good Thing and something we should all be doing.
  • The way I created the table from the start and end date is quite interesting: the List.Dates() function allows you to generate a list of dates from one date to another; I was then able to take this list and turn it into a table with Table.FromList().
  • Adding extra columns to the table for the year and month number and so on is fairly straightforward. However, as far as I can see, Power Query does not have the ability to return month or day names (as opposed to numbers) from a date, so to get round this I created two lists called MonthList and DayList containing the months of the year and the days of the week, in order, took the day and month numbers and used these to find the name at the nth position in the list (see the MonthName and DayName steps).
  • Date.IsInCurrentDay() compares a date with the current system date and allows you to see if a date is today’s date. There are a whole bunch of other interesting functions like this such as Date.IsInPreviousWeek(), Date.IsInCurrentYear() and so on, useful for creating relative date dimension tables like the one I describe here.

With the function created it’s very easy to generate a table using it. For example, I created a parameter table with start date and end date columns like so:

image

I then used this table as the starting point for a new query that called my function:

let

    Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],

    ChangedType = Table.TransformColumnTypes(Source,{{"Start Date", type date}

                  , {"End Date", type date}}),

    CallCreateDateTable = CreateDateTable(ChangedType[Start Date]{0}

                          ,ChangedType[End Date]{0})

in

    CallCreateDateTable

Here’s the output:

image

You can download my demo workbook with all the code in from here.

PS If you haven’t downloaded the November update for Power Query already, I recommend you do so – the UI has been improved a lot.

Written by Chris Webb

November 19, 2013 at 5:32 pm

Posted in Power BI, Power Query

Now(), The Formula Engine Cache And The Where Clause

leave a comment »

Back in 2009 I blogged about how the use of the Now() function inside calculated members prevents the results of those calculations being cached for longer than the lifetime of a query. It might be worth rereading that post before you carry on to get some background on the problem here:
http://cwebbbi.wordpress.com/2009/09/10/now-and-the-formula-cache/

Today I had an interesting email from a customer (thank you, Per!) showing me something I didn’t know: that using the Now() function in the Where clause of a query has the same effect.

Here’s the repro. On the MDX Script of the Adventure Works cube, I created the following calculated measure:

CREATE MEMBER CURRENTCUBE.MEASURES.TEST AS [Measures].[Internet Sales Amount] + 1;

Now consider the following query:

SELECT {MEASURES.TEST} ON 0,

{[Date].[Day Name].&[1]}

ON 1

FROM [Adventure Works]

WHERE(

STRTOMEMBER("[Date].[Calendar Year].&[" + "2003"  +  "]")

)

 

The first time the query is run you can see in Profiler the SSAS Storage Engine retrieving values; the second time it’s run you can see the values for MEASURES.TEST being returned from the Formula Engine cache.

Now consider this second query:

SELECT {MEASURES.TEST} ON 0,

{[Date].[Day Name].&[1]}

ON 1

FROM [Adventure Works]

WHERE(

STRTOMEMBER("[Date].[Calendar Year].&[" + cstr(Year(Now())-10)  +  "]")

)

 

It returns the same result as the previous query (at least so long as the current year is 2013), but you can see in Profiler that the second time the query is run, once again the Storage Engine cache is queried and the Formula Engine cache is not used:

image

So clearly the use of the Now() function in the Where clause is enough to prevent the use of global scope by the Formula Engine cache, and some limited testing suggests the same applies for other non-deterministic functions like Username() – which is hardly surprising. Something to watch out for, then…

Written by Chris Webb

November 10, 2013 at 10:41 pm

Posted in Uncategorized

Caching The Rows Returned By An MDX Query

with 2 comments

Here’s another tip for those of you struggling with the performance of SSRS reports that run on top of an Analysis Services Multidimensional cube. Quite often, SSRS reports require quite complex set expressions to be used on the rows axis of an MDX query, and one of the weaknesses of SSAS is that while it can (usually) cache the values of cells returned by a query it can’t cache the structure of the cellset returned by the query. What does this mean exactly? Well, consider the following query:

SELECT

{[Measures].[Internet Sales Amount]} ON 0,

NONEMPTY(

GENERATE(

[Date].[Calendar].[Month].MEMBERS,

{[Date].[Calendar].CURRENTMEMBER}

*

HEAD(ORDER([Customer].[Customer].[Customer].MEMBERS,

[Measures].[Internet Sales Amount],

BDESC),2)

),

[Measures].[Internet Sales Amount])

ON 1

FROM [Adventure Works]

WHERE([Product].[Category].&[3])

Here I’m taking every month on the Calendar hierarchy of the Date dimension and finding the top two customers by Internet Sales Amount for each Month; notice also that I’m slicing the query by a Product Category. The results look like this:

image

On my laptop this query takes just over three seconds to run however many times you run it (and yes, I know there are other ways this query can be optimised, but let’s imagine this is a query that can’t be optimised). The reason it is consistently slow is because the vast majority of the time taken for the query is to evaluate the set used on rows – even when the Storage Engine has cached the values for Internet Sales Amount for all combinations of month and customer, it still takes the Formula Engine a long time to find the top two customers for each month. Unfortunately, once the set of rows has been found it is discarded, and the next time the query is run it has to be re-evaluated.

How can we improve this? SSAS can’t cache the results of a set used on an axis in a query, but SSAS can cache the result of a calculated measure and calculated measures can return strings, and these strings can contain representations of sets. Therefore, if you go into Visual Studio and add the following calculated measure onto the MDX Script of the cube on the Calculations tab of the Cube Editor:

CREATE MEMBER CURRENTCUBE.MEASURES.REPORTROWS AS

SETTOSTR(

NONEMPTY(

GENERATE(

[Date].[Calendar].[Month].MEMBERS,

{[Date].[Calendar].CURRENTMEMBER}

*

HEAD(ORDER([Customer].[Customer].[Customer].MEMBERS,

[Measures].[Internet Sales Amount],

BDESC),2)

),

[Measures].[Internet Sales Amount])

);

You can then use this calculated measure in your query as follows:

SELECT

{[Measures].[Internet Sales Amount]} ON 0,

STRTOSET(MEASURES.REPORTROWS)

ON 1

FROM [Adventure Works]

WHERE([Product].[Category].&[3])

Having done this, on my laptop the query is just as slow as before the first time it is run but on subsequent executions it returns almost instantly. This is because the first time the query is run the set expression used on rows is evaluated inside the calculated measure ReportRows and it is then turned into a string using the SetToStr() function; this string is then returned on the rows axis of the query and converted back to a set using the StrToSet() function. The second time the query is run the string returned by the ReportRows measure has already been cached by the Formula Engine, which explains why it is so fast.

Couldn’t I have used a static named set declared on the cube to do this instead? I could, if I knew that the Where clause of the query would never change, but if I wanted to change the slice and look at a different Product Category I would expect to see a different set of rows displayed. While in theory I could create one gigantic named set containing every set of rows that ever might need to be displayed and then display the appropriate subset based on what’s present in the Where clause, this set could take a very long time to evaluate and thus cause performance problems elsewhere. The beauty of the calculated measure approach is that if you change the Where clause the calculated measure will cache a new result for the new context.

There are some things to watch out for if you use this technique, however:

  • It relies on Formula Engine caching to work. That’s why I declared the calculated measure on the cube – it won’t work if the calculated measure is declared in the WITH clause. There are a lot of other things that you can do that will prevent the Formula Engine cache from working too, such as declaring any other calculated members in the WITH clause, using subselects in your query (unless you have SSAS 2012 SP1 CU4), using non-deterministic functions and so on.
  • Remember also that users who are members of different roles can’t share formula engine caches, so if you have a lot of roles then the effectiveness of this technique will be reduced.
  • There is a limit to the size of strings that SSAS calculated measures can return, and you may hit that limit if your set is large. In my opinion an SSRS report should never return more than a few hundred rows at most for the sake of usability, but I know that in the real world customers do love to run gigantic reports…
  • There is also a limit to the size of the Formula Engine flat cache (the cache that is being used here), which is 10% of the TotalMemoryLimit. I guess it is possible that if you run a lot of different queries you could hit this limit, and if you do then the flat cache is completely emptied.

Written by Chris Webb

November 4, 2013 at 9:30 am

Follow

Get every new post delivered to your Inbox.

Join 2,868 other followers