Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for March 2013

Creating And Using Functions in Data Explorer (Power Query)

with 14 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

I’ve been reading through the Data Explorer Formula Language and Library specifications (which can be downloaded here) over the last few weeks, learning about what’s possible with it when you go beyond the UI. This will translate into a lot of Data Explorer blog posts in the coming weeks…! Here’s the first of those posts.

Today I’m going to look at how to create and use functions. Page 86 onwards in the Formula Language specification has a lot of detail about how to define a function in the M language that Data Explorer uses, but it doesn’t include any practical examples of how functions can be created and used in Data Explorer and Excel unfortunately. I’ve done some experimentation, though, and this post aims to fill that gap.

First of all, take a look at this Excel worksheet with two tables in it, called InputNumbers1 and InputNumbers2:

image

Let’s say that you want to create two Data Explorer queries that each use one of these tables as a source, and in both cases you want to add a third column that displays the product of the two existing columns. So, for example, for the first table you want a new column with the values 2, 12, 30 and 56 in.

Now this is a very simple calculation and certainly one that Data Explorer can handle easily, but let’s say that you want to perform this operation many times in many different queries and for obvious reasons you don’t want to have to duplicate the same logic in each query. You want to create a function!

To do this, you need to use the Write Query option from the From Other Sources button on the Data Explorer tab in the ribbon:

image

Then change the name of the new query to MultiplicationFunction by double-clicking on the query name in large letters at the top of the Edit Query dialog:

image

Then enter the following code as the only step:

= (x,y) => x * y

This defines a query with two parameters, x and y, and returns the product of x and y. Click Done to save the query. The query will show the following result in the worksheet:

image

You can ignore this new table, and you can disable the data load onto the worksheet, but it seems like you always need to have the table there and you certainly can’t delete it.

To use this new function create a new query from the table InputNumbers1 (shown above) using the From Table button. Then in the Edit Query dialog create a new step by selecting Insert Column/Custom:

image

And then use the new function to calculate each row in the new column, passing the two existing columns as parameters:

image

This results in the following step:

= Table.AddColumn(Source, “Custom”, each MultiplicationFunction([FirstNumber],[SecondNumber]))

image

And so you have your new column showing the product of the [FirstNumber] and [SecondNumber] columns. Then do the same thing for the InputNumbers2 table and you’ll have two Data Explorer queries now with custom columns in:

image

image

Finally, to make sure that the function is indeed being used, go and alter the definition of the function to be

= (x,y) => x / y

Refresh both of the other queries, and you’ll see the contents of the custom column have changed for both of them:

image

image

I can imagine that the ability to break functionality out into functions will reduce the complexity, and improvement the maintainability, of many ETL solutions that get built with Data Explorer in the future.

You can download my sample Excel workbook with this demo in here.

Written by Chris Webb

March 25, 2013 at 5:22 pm

UK/US Date Format Bug in PowerPivot and SSAS Tabular

with 4 comments

I don’t usually blog about bugs, but this one has been irritating me no end for the last year – so I thought it deserved some publicity…

In Excel 2010 PowerPivot and and in SSAS 2012 Tabular models (but not the Excel 2013 Data Model interestingly), if you have an English locale that is not US English (eg UK or Australian English), you may find that date columns appear to be formatted correctly as dd/mm/yyyy inside the PowerPivot window or in SSDT, but when you get to Excel you see the dates formatted in the US mm/dd/yyyy format. So, for example, on my laptop if I import the DimDate table from Adventure Works into Excel 2010 then I see dates formatted as dd/mm/yyyy as I’ve specified in the Formatting section of the ribbon in the PowerPivot window:

image

image

However, in an Excel PivotTable, I see dates formatted as mm/dd/yyyy:

image

There is a workaround though, which I found on the PowerPivot forum (thank you Steve Johnson, if you’re reading) – you can get the dates to format correctly if you go to More Date Formats and choose dd/MM/yy or one of the other formats from the dropdown list that appears:

image

image

Here are the correctly formatted dates in a PivotTable:

image

It seems like there is already a Connect open on this issue here, so please vote to get it fixed!

Written by Chris Webb

March 21, 2013 at 9:30 am

Finding Shakespeare’s Favourite Words With Data Explorer (Power Query)

with 23 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

Before I do that, though, some things to point out. First, there’s a new update of Data Explorer that appeared a few days ago – the Data Explorer team blog has the details. One of the new bits of functionality is a button that allows you to edit all of the expressions in your query at once:

image

Second, when you’re building a query, when you want to add a new step manually that refers to the previous step but doesn’t apply any calculations or transforms, you need to click on the fx button next to the expression on an existing step:

image

I’ve used this quite extensively to write custom steps that aren’t possible with the UI.

Here is the full code for my query to find the top 100 words:

Source = Lines.FromBinary(File.Contents("C:\Users\Chris\Documents\CompleteShakespeare.txt")),

RemoveLineBreaks = Lines.ToText(Source, " "),

RemovePunctuation =

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"'"

                            ,"@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"}),

Lowercase = Text.Lower(RemovePunctuation),

TurnIntoTable = Table.FromValue(Lowercase),

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {"Value", Splitter.SplitTextByWhitespace()}),

ListOfWords = SplitIntoWords{0}[Value],

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

RenameColumnToWord = Table.RenameColumns(TableFromList,{{"Column1", "Word"}}),

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <;> "")),

FindWordCounts = Table.Group(RemoveBlanks, {"Word"}, {{"Count", each Table.RowCount(_), type number}}),

SortedRows = Table.Sort(FindWordCounts,{{"Count", Order.Descending}}),

KeptFirstRows = Table.FirstN(SortedRows,100)

Broken down step-by-step:

Source = Lines.FromBinary(File.Contents(“C:\Users\Chris\Documents\CompleteShakespeare.txt”))

This loads the contents of CompleteShakespeare.txt into Data Explorer and breaks it up into lines of text:

image

RemoveLineBreaks = Lines.ToText(Source, ” “)

This removes all the line breaks and creates a single chunk of text:

image

RemovePunctuation = Text.Remove(RemoveLineBreaks,{“,” ,”.” ,”?” ,”;” ,”:” ,”;” ,”‘” ,”@” ,”#” ,”~” ,”{” ,”[" ,"}" ,"]” ,”(” ,”)”, “*”})

Removes any punctuation, and

Lowercase = Text.Lower(RemovePunctuation)

Turns all the words to lowercase:

image

TurnIntoTable = Table.FromValue(Lowercase)

Takes the text and creates a table with one column and one row containing the text:

image

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace()})

Takes the text and creates a List object with one entry for every word in the text.

ListOfWords = SplitIntoWords{0}[Value]

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error)

RenameColumnToWord = Table.RenameColumns(TableFromList,{{“Column1″, “Word”}})

Takes the list and turns it into a table with one column and one row for each word:

image

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <> “”))

Removes any rows containing blanks.

FindWordCounts = Table.Group(RemoveBlanks, {“Word”}, {{“Count”, each Table.RowCount(_), type number}})

Does a ‘group by’ to return a table with one row for each distinct word, and a column containing the number of occurrences of that word:

image

SortedRows = Table.Sort(FindWordCounts,{{“Count”, Order.Descending}})

Sorts this table in descending order by word count.

KeptFirstRows = Table.FirstN(SortedRows,100)

Only returns the first 100 rows from this table.

And at this point, we have the top 100 words used by Shakespeare (admittedly including the text at the beginning and end of the file inserted by Project Gutenberg that I couldn’t be bothered to remove):

image

No surprises here really – ‘the’ comes in at #1 with 18444 occurrences, “lord” comes in at #45, “king” at #69 and so on.

While you probably aren’t interested in Shakespeare for your own business purposes, it shows what would be possible if you wanted to analyse what was being said in Facebook updates, Tweets, product reviews and so on. It wasn’t all that hard to write either, especially considering that I’m fairly new to Data Explorer, so more sophisticated variations on this would certainly be possible without much extra effort. And it’s worth pointing out that it’s fast too: it takes a couple of seconds to refresh, and I’m not even sure I’ve done everything in the most efficient way possible.

You can download the sample workbook (but not the Complete Works of Shakespeare) here. If you want to point it at a different text file, just modify the first line of the code.

UPDATE: it turns out there is either a bug in PQ or I’m hitting one of its limits, and the counts here aren’t all correct. See the comments from Frank below for a version that works.

Written by Chris Webb

March 15, 2013 at 7:15 pm

The Dangers of Non_Empty_Behavior

with 7 comments

One thing I see quite often when I’m reviewing MDX is the incorrect use of the Non_Empty_Behavior property. 99% of the time it’s set without the developer really knowing what it does, and luckily in most cases it makes no difference to performance or what the calculation returns. However… in this post I’m going to argue that you should not use it unless you really know what you’re doing and you can prove that it’s beneficial – because if it’s set incorrectly it can sometimes mean you get incorrect results from your queries.

First of all, what does Non_Empty_Behavior actually do? Basically, it’s a hint to the SSAS Formula Engine that says that a calculation will return a null value if a given tuple also returns a null value. Moreover it’s a hint that may or may not be ignored. For example, if you had an MDX calculated measure that was defined as

CREATE MEMBER CURRENTCUBE.[Measures].[MyMeasure] as [Measures].[Internet Sales Amount]*2;

It would be true to say that this calculated measure would always return a null value when the real measure [Measures].[Internet Sales Amount] returned null, so you would be able to say:

CREATE MEMBER CURRENTCUBE.[Measures].[MyMeasure] as [Measures].[Internet Sales Amount]*2
, NON_EMPTY_BEHAVIOR=[Measures].[Internet Sales Amount];

The late, great Mosha covered its use in detail in a number of blog posts from a few years ago:
http://sqlblog.com/blogs/mosha/archive/2006/11/05/budget-variance-a-study-of-mdx-optimizations-evaluation-modes-and-non-empty-behavior.aspx
http://sqlblog.com/blogs/mosha/archive/2007/03/06/averages-ratios-division-by-zero-and-non-empty-behavior.aspx

These posts all refer to SSAS 2005, and from SSAS 2008 onwards a lot of work went on to ensure that most of the scenarios where Non_Empty_Behavior had been necessary to get good performance were optimised away. Certainly, in the simple example I showed above there would be no need to set Non_Empty_Behavior at all – the Formula Engine will already be able to optimise it as much as it can. This means that if you’re on SSAS 2005 then Non_Empty_Behavior can be very useful for tuning calculations, but if you are using SSAS 2008 or greater it’s highly likely that you do not need to use Non_Empty_Behavior at all. In fact, Mosha’s advice back in 2009 when he did a preconference seminar at the PASS Summit on SSAS performance tuning was summed up in these two slides:

image

image

I think the message here is pretty clear…

So, it probably won’t help your query performance, but why is it dangerous? Consider the following query:

WITH
MEMBER MEASURES.SomeTest AS
IIF(Now()>0, "Internet", "Reseller")
MEMBER MEASURES.Demo as
STRTOMEMBER("[Measures].[" + measures.sometest + " Sales Amount]")
SELECT
{[Measures].Demo} 
ON 0,
NON EMPTY
[Product].[Category].[Category].members
*
[Date].[Date].[Date].members
ON 1
FROM [Adventure Works]

This returns 1887 rows. However, if I set Non_Empty_Behavior incorrectly, like so:

WITH
MEMBER MEASURES.SomeTest AS
IIF(Now()>0, "Internet", "Reseller")
MEMBER MEASURES.Demo as
STRTOMEMBER("[Measures].[" + measures.sometest + " Sales Amount]")
, NON_EMPTY_BEHAVIOR=[Measures].[Reseller Sales Amount]
SELECT
{[Measures].Demo}
ON 0,
NON EMPTY
[Product].[Category].[Category].members
*
[Date].[Date].[Date].members
ON 1
FROM [Adventure Works]

The query now returns only 64 rows! By saying that the calculation will return a null value when Reseller Sales Amount is null – which is not the case – then rows that do in fact have values will get filtered out by the non empty clause. Clearly this is a bad thing, and also one that may be very hard to see happening when you are testing.

The example above was quite obviously wrong, and in fact it’s quite hard on SSAS 2012 to find a query where Non_Empty_Behavior returns incorrect results. More often I see calculations like this:

WITH
MEMBER MEASURES.PreviousSales AS
([Measures].[Internet Sales Amount], [Date].[Calendar].CURRENTMEMBER.PREVMEMBER)
, NON_EMPTY_BEHAVIOR=[Measures].[Internet Sales Amount]
SELECT {[Measures].[Internet Sales Amount], measures.PreviousSales} ON 0,
[Date].[Calendar].[Calendar Year].MEMBERS ON 1
FROM [Adventure Works]

…where Non_Empty_Behavior is set incorrectly (the calculation returns the Internet Sales Amount for the previous time period, and it’s not true to say that the previous time period’s sales will be null if the current time period’s sales are null) but the query still returns the correct results. This is because, luckily, Non_Empty_Behavior is being ignored.

So, to sum up, if you are using SSAS 2008 or greater Non_Empty_Behavior may not solve your performance problem and may result in incorrect results. That is not to say that there are still one or two isolated scenarios where it can still be useful. Jorg Klein came across one fairly recently and blogged about it here:
http://sqlblog.com/blogs/jorg_klein/archive/2013/02/18/implement-ssas-md-cell-security-using-dimension-security-with-blazing-performance.aspx

But these scenarios are few and far between. Therefore don’t use it if you’re not sure what it does!

Written by Chris Webb

March 11, 2013 at 1:30 pm

SQLDay 2013 Conference in Poland

with one comment

This is just a quick note to say that I’ll be speaking, and running a preconference seminar, at the SQLDay 2013 conference in Poland on the 22nd-24th May. I’m looking forward to it because I’ve never been to Poland before!

You can find all the details of my precon, “Real World Cube Design and Performance Tuning with Analysis Services”, here:
http://conference.plssug.org.pl/lang/en-us/agenda/sql-day-2013/sroda-22-05wednesday-22-05/w1-bi-track-real-world-cube-design-and-performance-tuning-with-analysis-services/

If you can’t make it to Poland, I’d love to see you at the PASS BA Conference (btw the discount code BAC521BL now gives you a $200 discount if you haven’t booked already), DatabaseDays Switzerland, SQLBits or one of my courses in London. Blimey, that’s a lot of teaching… and there’s more to be announced soon too.

Written by Chris Webb

March 7, 2013 at 4:47 pm

Posted in Events

Data Explorer (Power Query) –Where Does The Real Work Get Done?

with 6 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

Yes, I know, I’ve been blogging a lot over the last week or so – I can’t help it, there’s been so much interesting stuff to blog about! Anyway, here’s yet another post about Data Explorer; not about functionality this time, but the result of some investigations about how it works and some ideas that resulted from that.

After having played around with Data Explorer so much, I started to wonder just how scalable it is. In the course of doing this I found out that while Data Explorer can (and must) do a lot of the data processing logic itself, it pushes as much of that logic back to the data source whenever it can. For example, if you connect to the Adventure Works DW database in SQL Server and then implement some simple steps in a Data Explorer query that:

  1. Import everything from the DimDate table, hiding everything except the DayNumberOfWeek column
  2. Filter it so that DayNumberOfWeek is greater than 5:
    = Table.SelectRows(DimDate, each [DayNumberOfWeek] > 5)
  3. Return the count of the number of rows that are returned:
    = Table.RowCount(FilteredRows)

…you can see if you run a Profiler trace that all of this logic is translated to SQL and not evaluated in Data Explorer itself. Here’s the final SQL statement generated by the third step above:

select count(1) as [$Item]
from
(
select [_].[DayNumberOfWeek]
from
(
select [DayNumberOfWeek]
from [dbo].[DimDate] as [$Table]
) as [_]
where [_].[DayNumberOfWeek] > 5
) as [$Table]

I’m not sure which other data sources this is possible for, but it’s a good thing to see – you’re more likely to get good performance by doing the work back in the database than on your desktop. I wonder if it does this for Hive queries? If so, that would make it very useful for working with Hadoop/HDInsight.

And what about data sources where this isn’t possible though, like csv? I decided to test it out by loading a very large (1.4 GB) csv file I have with two columns containing integer values and 86 million rows, filtering it so that one of the column values was not null and less than 500000, and finding the resulting count of rows. Unfortunately I had to kill the query refresh after 15 minutes… In contrast, loading the same file into the Excel Data Model took almost exactly 6 minutes, and a DAX measure to do the same count/filter operation returned instantly. Not really a surprise I suppose, but it would be nice to have the option when working with csv files to load data into the Excel Data Model before you did anything else and then, when you did your transforms in Data Explorer, have those translated back to DAX queries and calculations. Alas, Data Explorer doesn’t support the Excel Data Model, PowerPivot or SSAS as data sources at the moment – I hope this changes soon.

This leads me on to the observation that in a lot of self-service BI scenarios the distinction between ETL and reporting can be blurred. If you just want to see a simple table showing your sales broken down by product in an Excel spreadsheet, you could build this quite easily with Data Explorer and not go near PowerPivot or the Excel Data Model. It would probably be easier to build too, since Data Explorer allows you to break complex operations into multiple steps, whereas putting all your logic in a single DAX expression can be mind-bendingly difficult (I’m reminded of this post I wrote two years ago). On the other hand you can do a lot of ETL-type things inside PowerPivot with calculated columns and measures; people have also been asking for the ability to use tables in the Excel Data Model as a data source for other tables (see here for example), for ETL purposes – it can be much faster and more elegant to use DAX to do some types of ETL work than SQL or SSIS. In the future I’d love to see tighter integration between Data Explorer and the Excel Data Model/PowerPivot so that we can get the best of both worlds.

Written by Chris Webb

March 6, 2013 at 4:53 pm

Calling A Web Service From Data Explorer (Power Query), Part 2

with 3 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

Don’t you hate it when you get ready to blog about something cool, and then someone comes along and beats you to it? That’s what’s just happened to me – Matt Masson just wrote an excellent blog post on how to do address lookup against a web service here:
http://www.mattmasson.com/2013/03/dynamic-lookups-with-data-explorer/
That’s what I was going to write about in part 2 of this series. I have to say he did a better job than I would have done though, and shows off several cool tricks I hadn’t seen before…

Ah well, so it goes – and anyway there are a few things I was going to say that are still worth saying. If you remember, in Part 1 I showed how to call the Google Distance Matrix API from Data Explorer, but I hard-coded the start and end points in the expression I used. How can you get the start and end points directly from Excel? Well it turns out that Data Explorer can read values direct from Excel tables.

To start with, you need to create some Excel tables to hold the values you want to use. I created two Excel tables called Origin and Destination to hold the two values I needed:

image

I can then take the value from the first row in the sole column of each table (which are again called Origin and Destination, as you can see) by using

Excel.CurrentWorkbook(){[Name="Origin"]}[Content]{0}[Origin]

and

Excel.CurrentWorkbook(){[Name="Destination"]}[Content]{0}[Destination]

Here’s the full source expression with the hard-coded values replaced:

=
Xml.Document(
Web.Contents(
http://maps.googleapis.com/maps/api/distancematrix/xml&#8221;
, [Query = [
origins= Excel.CurrentWorkbook(){[Name="Origin"]}[Content]{0}[Origin]
, destinations = Excel.CurrentWorkbook(){[Name="Destination"]}[Content]{0}[Destination]
, mode = “driving”, sensor = “false” ] ]))

Now, you can enter any address in the Origin and Destination tables, click Refresh on the Query as shown in the screenshot below:

image

And the query will be rerun with the new parameters. Note that it might also be a good idea to URL encode any text that you enter in these tables, using the EncodeURL() Excel function; I haven’t done so here because I’m using UK postcodes, which don’t need URL encoding, as my origins and destinations to make things simple.

It’s also possible to use the results of one query inside another query. Data Explorer allows you to use an Excel table as the source for a query:

image

And selecting the Destination Excel table as a source gives the following expression:

= Excel.CurrentWorkbook(){[Name="Destination"]}[Content]

The easy way to use any value from an existing query as the starting point for a new query is to right-click on a cell and select Add As New Query:

image

Doing this creates a new query with the value you clicked on a source:

image

Here’s the expression for the only step in the newly-created query:

= Destination{0}[Destination]

image

Even if this new query isn’t any use on its own, the expression shown here can be used in other queries. Here’s the original web service call adapted to use two other queries as its source:

= Xml.Document(
Web.Contents(
http://maps.googleapis.com/maps/api/distancematrix/xml&#8221;
, [Query = [
origins= Origin{0}[Origin]
, destinations = Destination{0}[Destination]
, mode = “driving”, sensor = “false” ] ]))

However, as Matt shows in his post, this only works if Fast Combine is enabled, otherwise you’ll see an error message.

You can download the sample workbook for this post here.

Written by Chris Webb

March 4, 2013 at 8:53 pm

Follow

Get every new post delivered to your Inbox.

Join 3,309 other followers