Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

This Is My 1000th Blog Post

with 8 comments

Just a few months away from the tenth anniversary of my first post here, I’ve reached the milestone that is my 1000th blog post. If you’ve been with me since back then, thanks for reading! I have no idea how I managed to write so much – it’s an average of around two posts per week, which I certainly haven’t managed recently – but I suspect that the answer lies in the fact that I posted a lot of rubbish here in the early years that I’m embarrassed by now.

I can remember the day when I decided to start this blog quite well. It was just after Christmas so the office was quiet and I didn’t have much work to do; blogging was the cool new thing back in late 2004 and having discovered that Mosha had started a blog I thought it was something I should be doing too, so as not to be left behind. Microsoft had just launched its own blogging platform so I signed myself up. I didn’t think I would stick at it this long…

At first I thought I would just use it writing up solutions to common Analysis Services and MDX problems, so that I didn’t have to keep repeating myself when I was answering questions on the microsoft.public.sqlserver.olap newsgroup. I kept going, though, for a lot of other reasons:

  • To remember what I’ve learned. If I didn’t write this stuff down I would forget it, and trust me, I’m always googling for old posts here. This also explains why there is very little overall structure or purpose to what I write about. Technical books need to cover a topic very methodically: start at the basics, explain all the concepts and functionality, not miss anything out, and so on. Here, if I learn something interesting and useful while at work, or helping someone on a forum, or while playing around with a new tool, I just need to write that one thing down and not worry about whether it fits into some greater plan.
  • I also find that the act of writing up a problem or topic for a post helps me understand it better. To be able to explain a technical concept you first have to be sure you understand it properly yourself, and writing for other people forces you to do that.
  • To pass on Microsoft BI-related news. I work with these tools every day and so it’s natural that I want to find out what new toys I’ll have to play with in the future. I find this stuff interesting and fun, and it seems like there are several thousand other people around the world who also want to know what’s going on (even if we might not want to admit this publicly). I like airing my opinions too: sometimes Microsoft does things I agree with, sometimes it does things I think are crazy, and since my career and business is wholly dependent on Microsoft BI I think the occasional bit of public feedback is healthy and allowable. Brent Ozar sums up my feelings on this subject perfectly here. I’ve got in trouble once or twice for things I’ve written, but I’ve never regretted writing any of my posts.
  • It’s marketing for my consultancy and training. I have to make a living somehow, and if I didn’t blog then it would be much harder to find customers – I think my blog is much more valuable in this respect than writing books or speaking at conferences or user groups. I don’t want to sound cynical, though, and I don’t see this blog as something that is purely commercial. I love to share and it just so happens that sharing my knowledge is also good for business. Some two years after starting this blog, just after I resigned from my permie job to become a self-employed consultant, one of my soon-to-be ex-colleagues said to me “You know, you’ll have to stop blogging now: why would anyone hire you if they can read everything you know on your blog for free?”. I didn’t have a good answer for him at the time but I soon found that if someone finds the answer to a problem on my blog, they are much more likely to think about hiring me when they have a problem they can’t solve. What’s more, I firmly believe that the way that people in the SQL Server community share knowledge publicly, even when they are aware that this knowledge could be used by their competitors, means that the community as a whole is stronger, SQL Server is more successful, and we all benefit more commercially than if we had not shared in the first place.
  • I enjoy writing so I’m quite happy to spend my spare time writing blog posts. There’s no way I could have forced myself to write a thousand posts if I didn’t enjoy doing it. I also travel a lot for work, so that results in a lot of time spent in airports and hotel rooms with nothing better to do. To make another comparison with writing tech books: a tech book has to be objective, impartial, polished, structured, sober and impersonal, whereas a blog is (or at least in my opinion should be) personal, subjective, haphazard, rough-edged and sometimes controversial. This makes blogging less of an effort and more of a pleasure.
  • Finally, I admit it, I get a kick out of knowing that when I write something there are people out there who want to read it.

Will I make it to my 2000th post? I have no idea, but I probably will if Microsoft are still making BI tools and I’m still using them.

Written by Chris Webb

August 31, 2014 at 11:45 pm

Posted in Random Thoughts

Power Query Functions That Return Functions

with 2 comments

You’re probably aware that, in Power Query, a query can return a function. So for example here’s a very simple query (so simple that no let statement is needed) called MultiplyTwoNumbers with the following definition:

(x as number, y as number) => x * y

It can be used on the following table in Excel:

…to multiply the numbers in the column called Number by two and show the result in a custom column like so:

let

Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],

#”Inserted Custom” = Table.AddColumn(Source, “Custom”, each MultiplyTwoNumbers(2, [Number]))

in

#”Inserted Custom”

Here’s the output:

It’s also the case that a function can return another function. Consider the following query, called MultiplyV2:

let

EnterX = (x as number) =>

let

EnterY = (y as number) => x * y

in

EnterY

in

EnterX

It is a function that takes a single parameter, x, and it returns a function that takes a single parameter, y. The function that is returned multiplies the value of x by the value of y. Here’s an example of how it can be used on the table shown above:

let

//Return a function that multiplies by 2

MultiplyBy2 = MultiplyV2(2),

//Load data from the table

Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],

//Use the MultiplyBy2 function in a custom column

#”Inserted Custom” = Table.AddColumn(Source, “Custom”, each MultiplyBy2([Number]))

in

#”Inserted Custom”

This gives exactly the same result as before:

In this query, the MultiplyBy2 step calls the MultiplyV2 function with the argument 2, and this returns a function that multiplies the values passed to it by 2. This function can then be called in the final step where the custom column is added to the table using the expression MultiplyBy2([Number])

Interesting, isn’t it? I hope this satisfies your curiosity Marco J

You can download the sample workbook for this post here.

 

 

Written by Chris Webb

August 24, 2014 at 10:09 pm

Posted in Uncategorized

Removing Punctuation From Text In Power Query

with 6 comments

In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.

In the post I’m talking about I used the following expression to remove punctuation:

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})

However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.

Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:

{{0..31},{33..47},{58..64},{91..96},{123..191}}

and you can use List.Combine() to turn this list of lists into a single list of numbers.  Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:

let

    //get a list of lists containing the numbers of Unicode punctuation characters

    numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},

    //turn this into a single list

    combinedlist = List.Combine(numberlists),

    //get a list of all the punctuation characters that these numbers represent

    punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),

    //some text to test this on

    inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",

    //the text with punctuation removed

    outputtext = Text.Remove(inputtext, punctuationlist)

in

    outputtext

 

The output of this query is this:

image

You can download the sample workbook for this post here.

Written by Chris Webb

August 18, 2014 at 11:20 am

Posted in Power Query

Comparers, Combiners, Replacers and Splitters in Power Query

with 8 comments

At the end of the Power Query Formula Library Specification (which can be downloaded here) are sections on Comparer, Combiner, Replacer and Splitter functions. These functions are most often used in conjunction with other functions like Table.CombineColumns() and Table.SplitColumn, but what you may not realise from the documentation (which also has a few minor but nonetheless confusing bugs in it) is what these functions do: they are functions that return functions, and the functions that they return can be used independently just like any other function.

Take Splitter.SplitTextByDelimiter() as an example. It returns a function that splits a piece of text by a delimiter, and returns a list containing the resulting pieces. The following M code calls this function to return a function that splits comma delimited text:

let

    demo = Splitter.SplitTextByDelimiter(",")

in

    demo

 

As noted here, once you have a query that returns a function you can see that function’s signature and invoke it from the Query Editor window. Here’s what the query above shows in the Query Editor window:

image

If you click the Invoke button and enter the text

one,two,three,four

As follows:

image

So that the code for the query becomes:

let

    demo = Splitter.SplitTextByDelimiter(","),

    Invokeddemo = demo("one,two,three,four")

in

    Invokeddemo

What is returned is the list {“one”, “two”, “three”, “four”} which looks like this in the Query Editor window:

image

There are various other Splitter functions that can be used to return functions that split text in different ways. Similarly, the Combiner functions return functions that can be used to combine a list of text into a single piece of text. For example:

let

    demo = Combiner.CombineTextByDelimiter("--"),

    Invokeddemo = demo({"one","two","three","four"})

in

    Invokeddemo

Returns the text

one–two–three—four

The Replacer functions return functions for replacing values in text , while the Comparer functions return functions that can be used for comparing text using specific cultures and case sensitivities.

Written by Chris Webb

August 11, 2014 at 9:30 am

Posted in Power Query

The Ethics Of Big Data

with 4 comments

Some time ago I received a review copy of a book called “Ethics Of Big Data” from O’Reilly; I didn’t get round to writing a review of it here for a number of reasons but, despite its flaws (for example its brevity and limited scope), it’s worth reading. It deals with the ethics of data collection and data analysis from a purely corporate point of view: if organisations do not think carefully about what they are doing then

“Damage to your brand and customer relationships, privacy violations, running afoul of emerging legislation, and the possibility of unintentionally damaging reputations are all potential risks”

All of which is true, although I think what irked me about the book when I read it was that it did not tackle the wider and (to my mind) more important question of the social impact of new data technologies and their application. After all, this is what you and I do for a living – and I know that I haven’t spent nearly enough time thinking these issues through.

What prompted me to think about this again was a post by Adam Curtis which argues that the way that governments and corporations are using data is stifling us on a number of levels from the personal to the political:

“What Amazon and many other companies began to do in the late 1990s was build up a giant world of the past on their computer servers. A historical universe that is constantly mined to find new ways of giving back to you today what you liked yesterday – with variations.

Interestingly, one of the first people to criticise these kind of “recommender systems” for their unintended effect on society was Patti Maes who had invented RINGO. She said that the inevitable effect is to narrow and simplify your experience – leading people to get stuck in a static, ever-narrowing version of themselves.

Stuck in the endless you-loop.”

Once our tastes and opinions have been reduced to those of the cluster the k-means algorithm has placed us in we have become homogenised and easier to sell to, a slave to our past behaviour. Worse, the things we have in common with the people in other clusters become harder to see. Maybe all of this is inevitable, but if there is going to be an informed debate on this then shouldn’t we, as the people who actually implement these systems, take part in it?

Written by Chris Webb

August 4, 2014 at 9:30 am

Posted in Random Thoughts

Calculating The Value Of Overdue Invoices Using Many-To-Many Relationships in SSAS Multidimensional

with 11 comments

Recently I had to solve the same, rather complex, problem for three different customers within the space of two weeks, a variation of the ‘events-in-progress’ problem I’ve blogged about a lot in the past. It’s this one: how can you calculate the value of your overdue invoices on any given date? It’s important to stress that we are not talking about the invoices that are overdue today – we want to be able to pick any date in the past and find out the value of invoices that were overdue at that point in time.

Let’s imagine you have a fact table containing payments against invoices: each row has an invoice number, the date of the payment, the payment amount, and the date that the invoice is due. A positive value in the Amount column indicates that this is the opening amount of the invoice; negative values in the Amount column are payments against the invoice. Multiple payments can be made against an invoice before it is fully closed. Here’s some example data:

image

Looking at the rows highlighted for invoice 5, you can see in the upper box that there is a positive amount of £35 shown on January 1st 2014 – this is the date that the invoice was opened, and £35 is the full value of the invoice. The invoice is due on January 10th 2014. In the lower box you can see there were four separate payments of £5, £5, £10 and £15 before the invoice was fully paid off on January 12th 2014.

Now, if you were to do a total-to-date in MDX (similar to what I describe here) it would be pretty easy to calculate the outstanding amount on all invoices on any given date. However the requirement here is not only to do that, but to break the value down so that you can see what the total value for overdue invoices and non-overdue (which I’m going to call backlog in this post) invoices is on any given date. This is tricky to do because we have to take the Due Date of each invoice into account as well as the Payment Date. The desired output for invoice 5 is this:

image

As you can see in this PivotTable, for invoice 5 £25 was still outstanding on January 9th 2014, but this is shown as backlog because this is before the due date of January 10th. On January 11th, one day after the due date, the remaining outstanding amount of £15 is shown as overdue. Of course, we also want to be able to calculate the correct values for all invoices:

image

One way of solving this problem would be to calculate the overdue and backlog values for each date that each invoice is open in your ETL, and store these values in a snapshot fact table. This works, and will give you the best possible query performance, but it has two major drawbacks: first, it makes your ETL much slower, and second it makes your fact table much larger. This post describes how you can calculate the overdue and non-overdue values on any given date using many-to-many relationships instead, without having to blow out the fact table.

To go along with the fact table (called FactInvoice) shown above, I have a date dimension table called DimDate (I have deliberately reduced the number of rows here to the dates I have data for, for reasons that will become clear soon):

image

I have a dimension table containing all of the days that an invoice can be overdue or not overdue for (which has one less than double the number of rows as the date dimension table), plus a second column classifying each row as ‘Backlog’ or ‘Overdue’:

image

… and an invoice dimension table that just contains the distinct invoice numbers called DimInvoice.

Now, let me explain how to build the SSAS cube.

Step 1

  • Build dimensions from all of the tables shown above, so you have dimensions called Date, Overdue Days and Invoice.
  • Build a cube with one measure group, based on FactInvoice, and create one measure with AggregateFunction Sum based on the Amount column in that table.
  • Add the Invoice dimension to the cube with a regular relationship. Add the Date dimension to the cube twice, as Payment Date and Due Date, with regular relationships on the PaymentDate and DueDate columns. The Dimension Usage tab should look like this:

image

Step 2

  • Create the following view in SQL Server:
  • CREATE VIEW [dbo].[FactDateToPaymentDate]
    AS
    SELECT        a.DateKey, b.DateKey AS PaymentDateKey
    FROM            dbo.DimDate AS a INNER JOIN
                             dbo.DimDate AS b ON a.DateKey >= b.DateKey

    This view returns all of the combinations of a given date and all dates up to and including the date.

  • Add this view to the DSV and create a new measure group from it; you’ll need to create a measure here, but it can just be a Count measure.
  • Add the Date dimension to the cube once again, this time leaving the name as Date (you will now have three role-playing copies of the Date dimension in the cube).
  • Set up regular relationships between the Date and Payment Date dimensions and the new measure group, then a many-to-many relationship between Date and the Invoice measure group. This means that when you query the cube by the Date dimension, the many-to-many relationship will mean you see the sum of all Amounts whose payment date is up to and including the date selected. The Dimension Usage tab will look like this:

image

Step 4

  • Create another view in SQL Server with this definition:
  • CREATE VIEW [dbo].[FactDueDateToAsOfDate]
    AS
    SELECT        a.DateKey AS DueDateKey, b.DateKey AS AsOfDateKey, DATEDIFF(dd, a.FullDate, b.FullDate) AS OverDueDays
    FROM            dbo.DimDate AS a CROSS JOIN
                             dbo.DimDate AS b

    Yes, I am cross joining the DimDate table with itself and yes, this could return a lot of rows. However you should find that the view is very fast to execute.

  • Add this view to the DSV and build another measure group from it, again with a single Count measure on it.
  • Set up a regular relationship between this new measure group and the Due Date dimension.
  • Add the Overdue Days dimension to the cube, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group.
  • Add the Date dimension to the cube yet again, creating a new role-playing dimension called As Of Date, give it a regular relationship with the new measure group and a many-to-many relationship with the Invoice measure group through the new measure group. The Dimension Usage tab will now look like this:

image

The purpose of this measure group is this: if you select a date on the As Of Date dimension, you will be able to select ‘Overdue’ on the Overdue Days dimension and this will give you all of the dates on Due Date that were overdue on that date.

Step 5

  • You only want to have to select one date in your PivotTable, so create the following MDX calculated measures that take your selection on the Date dimension and applies it to the As Of Date dimension too:

CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount Hidden] AS
([Measures].[Amount],
LINKMEMBER([Date].[Date].CURRENTMEMBER, [As Of Date].[Date])
), VISIBLE=FALSE;

CREATE MEMBER CURRENTCUBE.MEASURES.[Open Amount] AS
IIF(MEASURES.[Open Amount Hidden]=0, NULL, MEASURES.[Open Amount Hidden]);

  • Hide all measures apart from the calculated measures you’ve just created
  • Hide the Payment Date, Due Date and As Of Date dimensions

Conclusion

This is a very complex pattern, I know, and this is after I’ve simplified it a lot (if you need currency conversion as well then things get even worse) but I also know it’s extremely useful from a business point of view. Query performance is also reasonably good, at least in the places where I have implemented this.

You can download my sample SQL Server database and VS 2012 project here.

Written by Chris Webb

July 28, 2014 at 10:00 am

Working With Excel Named Ranges In Power Query

with 24 comments

One of the more recent additions to Power Query is the ability to access data from named ranges in the Excel worksheet rather than an Excel table. I’ve got used to formatting data as tables in Excel because that’s what Power Pivot needs to import data directly from the worksheet, but if you are working with Power Query and a pre-existing workbook then trying to reformat data as tables can be a pain. Also, if you just want to import a single value, for example as a parameter to a query, a table seems like overkill.

(Incidentally, if you’re wondering what a named range is in Excel, there are tons of good introductions to the subject on the internet like this one. You can do loads of cool stuff with them.)

Consider the following Excel worksheet:

image

There are three named ranges here: FirstRange, SecondRange, ThirdRange, and the values in the cells show which range the cells are in. FirstRange consists of two cells in two columns; SecondRange consists of three cells in a single row; and ThirdRange is consists of three, non-contiguous cells. (You can also use this trick to display the names of all contiguous ranges in an Excel workbook, but alas it does seem to work for non-contiguous ranges).

At the moment, the Power Query ribbon doesn’t make it obvious that you can use named ranges as data sources. However you can see all the tables and cells in a worksheet, and even return that list from a query, by creating a blank query and using the expression

= Excel.CurrentWorkbook()

image

Once you’ve done this you can see all the ranges (and also any tables) in the workbook, and click on the table link next to the name to see the data. For example, clicking on FirstRange shows the following table in a new step in the query editor:

image

The expression to get at this table in a single step is:

= Excel.CurrentWorkbook(){[Name="FirstRange"]}[Content]

The range SecondRange in my example is equally straightforward to reference, and you can see its contents by using the expression

= Excel.CurrentWorkbook(){[Name="SecondRange"]}[Content]

image

Unfortunately ThirdRange, which is not contiguous, is a problem: I can only get the first cell in the range. So the expression

= Excel.CurrentWorkbook(){[Name="ThirdRange"]}[Content]

Returns just this table:

image

It would be nice if we could get a list containing the cell values, rather than a table, for ranges like this…

Last thing to mention is that if you do want the value in a cell, rather than a table, you just need to right-click inside the cell in the Query Editor and select Drill Down:

image

This returns the value (in this case the text “Third Range Cell 1”) in the cell you clicked on:

image

This is a much more useful value to return than a table containing a single row/column, if you intend to use a value from a single cell in a named range as a parameter to another query.

You can download the sample workbook for this post here.

Written by Chris Webb

July 22, 2014 at 9:30 am

Posted in Excel, Power Query

Follow

Get every new post delivered to your Inbox.

Join 3,144 other followers