Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Handling Added Or Missing Columns In Power Query

with 3 comments

A recent conversation in the comments of this blog post brought up the subject of how to handle columns that have either been removed from or added to a data source in Power Query. Anyone who has worked with csv files knows that they have a nasty habit of changing format even when they aren’t supposed to, and added or removed columns can cause all kinds of problems downstream.

Ken Puls (whose excellent blog you are probably already reading if you’re interested in Power Query) pointed out that it’s very easy to protect yourself  against new columns in your data source. When creating a query, select all the columns that you want and then right-click and select Remove Other Columns:

image

This means that if any new columns are added to your data source in the future, they won’t appear in the output of your query. In the M code the Table.SelectColumns() function is used to do this.

Dealing with missing columns is a little bit more complicated. In order to find out whether a column is missing, first of all you’ll need a list of columns that should be present in your query. You can of course store these tables in a table in Excel and enter the column names manually, or you can do this in M fairly easily by creating a query that connects to your data source and using the Table.ColumnNames() function something like this:

let
    //Connect to CSV file
    Source = Csv.Document(
                      File.Contents(
                       "C:\Users\Chris\Documents\Power Query demos\SampleData.csv"
                      ),null,",",null,1252),
    //Use first row as headers
    FirstRowAsHeader = Table.PromoteHeaders(Source),
    //Get a list of column names
    GetColumns = Table.ColumnNames(FirstRowAsHeader),
    //Turn this list into a table
    MakeATable = Table.FromList(
                              GetColumns, 
                              Splitter.SplitByNothing(), 
                              null, 
                              null, 
                              ExtraValues.Error),
    //Rename this table's sole column
    RenamedColumns = Table.RenameColumns(
                                         MakeATable ,
                                         {{"Column1", "ColumnName"}})
in
    RenamedColumns 

Given a csv file that looks like this:

image

…the query above returns the following table of column names:

image

You can then store the output of this query in an Excel table for future reference – just remember not to refresh the query!

Having done that, you can then look at the columns returned by your data source and compare them with the columns you are expecting by using the techniques shown in this post. For example, here’s a query that reads a list of column names from an Excel table and compares them with the columns returned from a csv file:

let
    //Connect to Excel table containing expected column names
    ExcelSource = Excel.CurrentWorkbook(){[Name="GetColumnNames"]}[Content],
    //Get list of expected columns
    ExpectedColumns = Table.Column(ExcelSource, "ColumnName"),
    //Connect to CSV file
    CSVSource = Csv.Document(
                      File.Contents(
                       "C:\Users\Chris\Documents\Power Query demos\SampleData.csv"
                      ),null,",",null,1252),
    //Use first row as headers
    FirstRowAsHeader = Table.PromoteHeaders(CSVSource),
    //Get a list of column names in csv
    CSVColumns = Table.ColumnNames(FirstRowAsHeader),
    //Find missing columns
    MissingColumns = List.Difference(ExpectedColumns, CSVColumns),
    //Find added columns
    AddedColumns = List.Difference(CSVColumns, ExpectedColumns),
    //Report what has changed
    OutputMissing = if List.Count(MissingColumns)=0 then
                     "No columns missing" else
                     "Missing columns: " & Text.Combine(MissingColumns, ","),
    OutputAdded = if List.Count(AddedColumns)=0 then
                     "No columns added" else
                     "Added columns: " & Text.Combine(AddedColumns, ","),
    Output = OutputMissing & "   " & OutputAdded
in
    Output

Given a csv file that looks like this:

image

…and an Excel table like the one above containing the three column names Month, Product and Sales, the output of this query is:

image

It would be very easy to convert this query to a function that you could use to check the columns expected by multiple queries, and also to adapt the output to your own needs. Also, in certain scenarios (such as when you’re importing data from SQL Server) you might also want to check the data types used by the columns; I’ll leave that for another blog post though. In any case, data types aren’t so much of an issue with CSV files because it’s Power Query that imposes the types on the columns within a query, and any type conversion issues can be dealt with by Power Query’s error handling functionality (see Gerhard Brueckl’s post on this topic, for example).

You can download a workbook containing the two queries from this post here.

Written by Chris Webb

February 26, 2015 at 12:37 pm

Posted in Power Query

Optimising SSAS Many-To-Many Relationships By Adding Redundant Dimensions

with one comment

The most elegant way of modelling your SSAS cube doesn’t always give you the best query performance. Here’s a trick I used recently to improve the performance of a many-to-many relationship going through a large fact dimension and large intermediate measure group…

Consider the following cube, built from the Adventure Works DW database and showing a many-to-many relationship:

image

The Fact Internet Sales measure group contains sales data; the Product, Date and Customer dimensions are what you would expect; Sales Order is a fact dimension with one member for each sales transaction and therefore one member for each row in the fact table that Fact Internet Sales is built from. Each Sales Order can be associated with zero to many Sales Reasons, and the Sales Reason dimension has a many-to-many relationship with the Fact Internet Sales measure group through the Fact Internet Sales Reason measure group. Only the Sales Order dimension connects directly to both the Fact Internet Sales Reason and Fact Internet Sales measure groups.

There’s nothing obviously wrong with the way this is modelled – it works and returns the correct figures – and the following query shows how the presence of the many-to-many relationship means you can see the Sales Amount measure (from the Fact Internet Sales measure group) broken down by Sales Reason:

select
{[Measures].[Sales Amount]} on 0,
non empty
[Sales Reason].[Sales Reason].[Sales Reason].members
on 1
from m2m1
where([Date].[Calendar Year].&[2003], 
[Product].[Product Category].&[3],
[Customer].[Country].&[United Kingdom])

 

image

However, to understand how we can improve the performance of a many-to-many relationship you have to understand how SSAS resolves the query internally. At a very basic level, in this query, SSAS starts with all of the Sales Reasons and then, for each one, finds the list of Sales Orders associated with it by querying the Fact Sales Reason measure group. Once it has the list of Sales Orders for each Sales Reason, it queries the Fact Internet Sales measure group (which is also filtered by the Year 2003, the Product Category Clothing and the Customer Country UK) and sums up the value of Sales Amount for those Sales Orders, getting a single value for each Sales Reason. A Profiler trace shows this very clearly:

image

The Resource Usage event gives the following statistics for this query:

READS, 7

READ_KB, 411

WRITES, 0

WRITE_KB, 0

CPU_TIME_MS, 15

ROWS_SCANNED, 87299

ROWS_RETURNED, 129466

Given that the Sales Order dimension is a large one (in this case around 60000 members – and large fact dimensions are quite common with many-to-many relationships) it’s likely that one Sales Reason will be associated with thousands of Sales Orders, and therefore SSAS will have to do a lot of work to resolve the relationship.

In this case, the optimisation comes with the realisation that in this case we can add the other dimensions present in the cube to the Fact Sales Reason measure group to try to reduce the number of Sales Orders that each Sales Reason is resolved to. Since Sales Order is a fact dimension, with one member for each sales transaction, then since each sales transaction also has a Date, a Product and a Customer associated with it we can add the keys for these dimensions to the fact table on which Fact Sales Reasons is built and join these dimensions to it directly:

image

This is not an assumption you can make for all many-to-many relationships, for sure, but it’s certainly true for a significant proportion.

The Product, Date and Customer dimensions don’t need to be present for the many-to-many relationship to work, but adding a Regular relationship between them and Fact Internet Sales Reason helps SSAS speed up the resolution of the many-to-many relationship when they are used in a query. This is because in the original design, in the test query the selection of a single member on Sales Reason becomes a selection on all of the Sales Orders that have ever been associated with that Sales Reason; with the new design, the selection of a single member on Sales Reason becomes a selection on a combination of Dates, Customers, Products and Sales Orders – and since the query itself is also applying a slice on Date, Customer and Product, this is a much smaller selection than before. For the query shown above, with the new design, the Resource Usage event now shows:

READS, 11

READ_KB, 394

WRITES, 0

WRITE_KB, 0

CPU_TIME_MS, 0

ROWS_SCANNED, 47872

ROWS_RETURNED, 1418

The much lower numbers for ROWS_SCANNED and ROWS_RETURNED shows that the Storage Engine is doing a lot less work. For the amount of data in Adventure Works the difference in query performance is negligible, but in the real world I’ve seen this optimisation make a massive difference to performance, resulting in queries running up to 15 times faster.

Don’t forget that there are many other ways of optimising many-to-many relationships such as the those described in this white paper. Also, if you have a large fact dimension, if it does not need to be visible to the end user and is only needed to make the many-to-many relationship work, you can reduce the overhead of processing it by breaking it up into multiple smaller dimensions as described here.

Written by Chris Webb

February 21, 2015 at 6:38 pm

I’m speaking at the PASS BA Conference

with one comment

I haven’t been shy about stating my support for the PASS BA conference and the associated efforts by PASS to reach out beyond its traditional audience to analysts and other power users (see here for example). I won’t bore you with my opinions again, except to say that at the third attempt I think PASS have got the balance of session topics right at the upcoming PASS BA conference in Santa Clara this April. There’s a stellar team of Excel speakers, including Mr Excel and Chandoo. There’s David Smith from Revolution Analytics, the company bought by Microsoft recently; plenty of sessions on predictive analytics; various Microsoft dev teams will be out in force; and Marco Russo and I will be speaking too. I think it promises to be a great conference, definitely not a PASS BI conference, and very different from the PASS Summit.

You can register here, and using the code BASPCHR will give you a $150 discount.

Written by Chris Webb

February 15, 2015 at 9:56 pm

Posted in PASS

More Power BI Licensing Details

with 2 comments

At the end of last week I came across an interesting link on Twitter (thanks Devin) with more details on how the ‘new’ Power BI will be licensed, and how existing Power BI subscribers will move to the new experience, that I thought was worth highlighting. Here it is:

http://www.microsoft.com/en-us/powerBI/licensing.aspx?utm_content=bufferfe7e5

Some points to highlight:

  • The Power BI service will become a standalone service and will no longer require SharePoint Online
    This is not exactly a surprise based on what’s been announced so far, but it’s the first time it’s been clearly stated that the ‘new’ Power BI is not dependent on Office 365 and SharePoint Online
  • At general availability of the new experience, existing customers will be asked to move to the new experience
    That’s to say the ‘new’ Power BI is a direct replacement for the old, O365-based Power BI, and that there will only be a single version – the ‘new’ version – of Power BI in the future
  • Power BI will seamlessly work with Office 365 for customer with subscriptions to both
    and
    This move will require configuring the new experience to connect to existing Excel workbooks. Users will need to either point to their Excel workbooks on SharePoint Online or reload these workbooks into Power BI
    If you are an existing Office 365/SharePoint Online/Power BI customer, there will be integration between SharePoint Online and the new Power BI service – so it looks like no functionality is going to be lost in the transition
  • Existing Power BI for Office 365 customers will be able to transition from the existing user experience to the new user experience when it is made available using their existing subscription license…
    This license transition should not interrupt access to the Power BI service.
    So existing customers should be able to move to the ‘new’ Power BI very easily when it is released (remember, we’re currently in preview and the preview of the cloud service is only available to US customers)

If you’re a new customer thinking of signing up for the ‘new’ Power BI, you should check out the licensing options here:
http://www.powerbi.com/dashboards/pricing/

…and also this post on the Power BI support blog on making sure you sign up to the service with the appropriate email address:
http://blogs.technet.com/b/powerbisupport/archive/2015/02/06/power-bi-sign-up-walkthrough.aspx

Written by Chris Webb

February 9, 2015 at 9:30 am

Posted in Power BI

Expression.Evaluate() In Power Query/M

with 2 comments

A year ago I wrote a post on loading the M code for a Power Query query from a text file using the Expression.Evaluate() function, but I admit that at the time I didn’t understand how it worked properly. I’ve now had a chance to look at this function in more detail and thought it might be a good idea to post a few more examples of how it works to add to what’s in the Library spec.

The docs are clear about Expression.Evaluate does: it takes some text containing an M expression and evaluates that expression, returning the result. The important thing to remember here is that an M expression can be more than just a single line of code – a Power Query query is in fact a single expression, and that’s why I was able to execute one using Expression.Evaluate() in the blog post referenced above.

Here’s a simple example of Expression.Evaluate():

let
    Source = Expression.Evaluate("""Hello World""")
in
    Source

 

It returns, of course, the text “Hello World”:

image

Here’s another example of an expression that returns the number 10:

let
    Source = Expression.Evaluate("6+4")
in
    Source

 

image

Remember that I said that a Power Query query is itself a single M expression? The way Power Query implements multiple steps in each query is using a let expression, which is a single M expression that contains multiple sub-expressions. Therefore the following  example still shows a single expression (consisting of a let expression) being evaluated to return the value 12:

let
    Source = Expression.Evaluate("let a=3, b=4, c=a*b in c")
in
    Source

 

image

 

OK, so far not very interesting. What we really want to do is evaluate more complex M expressions.

Consider the following query, which uses the Text.Upper() function and returns the text ABC:

let
    Source = Text.Upper("abc")
in
    Source

 

If you run the following query you’ll get the error “The name ‘Text.Upper’ does not exist in the current context.”:

let
    Source = Expression.Evaluate("Text.Upper(""abc"")")
in
    Source

image

To get a full understanding of what’s going on here, I recommend you read section “3.3 Environments and Variables” of the language specification document, which is available here. The short explanation is that all M expressions are evaluated in an ‘environment’, where other variables and functions exist and can be referenced. The reason we’re getting an error in the query above is that it’s trying to execute the expression in an environment all of its own, where the global library functions aren’t available. We can fix this though quite easily by specifying the global environment (where the global library of functions that Text.Upper() is a part of are available) in the second parameter of Expression.Evaluate() using the #shared intrinsic variable, like so:

let
    Source = Expression.Evaluate("Text.Upper(""abc"")", #shared)
in
    Source

image

#shared returns a record containing all of the names of the variables in scope for the global environment and as such it can be used on its own in a query that returns all of the variables (including all of the functions in the global library and from all other queries in the current workbook) available:

let
    Source = #shared
in
    Source

 

Here’s what that query returns on the workbook that I’m using to write this post, which contains various queries apart from the one above:

image

Reza Rad has a blog post devoted to this which is worth checking out.

Using #shared will allow you to evaluated expressions that use global library functions but it’s not a magic wand that makes all errors go away. The following query declares a list and then attempts to use Expression.Evaluate() to get the second item in the list:

let
    MyList = {1,2,3,4,5},
    GetSecondNumber = Expression.Evaluate("MyList{1}", #shared)
in
    GetSecondNumber

image

Despite the use of #shared in the second parameter we still get the context error we saw before because the variable MyList is still not available. What you need to do here is to define a record in the second parameter of Expression.Evaluate() so that the environment that the expression evaluates in knows what the variable MyList refers to:

let
    MyList = {1,2,3,4,5},
    GetSecondNumber = Expression.Evaluate("MyList{1}", [MyList=MyList])
in
    GetSecondNumber

image

This slightly more complex example, which gets the nth item from a list, might make what’s going on here a little clearer:

let
    MyList_Outer = {1,2,3,4,5},
    NumberToGet_Outer = 3,
    GetNthNumber = Expression.Evaluate("MyList_Inner{NumberToGet_Inner}", 
       [MyList_Inner=MyList_Outer, NumberToGet_Inner=NumberToGet_Outer ])
in
    GetNthNumber

 

image

In this example you can see that the two variable names present in the text passed to the first parameter of Expression.Evaluate() are present in the record used in the second parameter, where they are paired up with the two variables from the main query whose values they use.

Finally, how can you pass your own variable names and use functions from the global library? You need to construct a single record containing all the names in #shared plus any others that you need, and you can do that using Record.Combine() to merge a manually created record with the one returned by #shared as shown here:

let
    MyList_Outer = {1,2,3,4,5},
    NumberToGet_Outer = 1,
    RecordOfVariables = 
     [MyList_Inner=MyList_Outer, NumberToGet_Inner=NumberToGet_Outer ],
    RecordOfVariablesAndGlobals = Record.Combine({RecordOfVariables, #shared}),
    GetNthNumber = Expression.Evaluate(
     "List.Reverse(MyList_Inner){NumberToGet_Inner}", 
     RecordOfVariablesAndGlobals )
in
    GetNthNumber

 

image

Written by Chris Webb

February 6, 2015 at 2:14 pm

Posted in Power Query

If I Could Have New Features In SSAS Multidimensional, What Would They Be?

with 37 comments

Indulge me for a moment, please. Let’s imagine that somewhere in Microsoft, someone is planning for SQL Server v.next and is considering investing in new features for SSAS Multidimensional (don’t laugh – I wouldn’t be writing this post if I didn’t think it was a possibility). What features should they be?

Before I answer that question, it’s worth pointing out that despite what you might think there has been some investment in SSAS Multidimensional over the last few years. This post lists what was new in SSAS 2012 Multidimensional; since then support for DAX queries has been added and, umm, the new Divide() function. This must have been a lot of work for someone – but why does it get overlooked? One reason: none of these changes have made much difference to the ordinary SSAS Multidimensional developer’s life. DAX query support is great if you’re one of the few people that uses the SharePoint version of Power View; shockingly, it still doesn’t work in Excel 2013 Power View yet (though I guess it will be the way the new Power BI connects to on-prem Multidimensional). NUMA support is great if you work for an investment bank and have vast amounts of data and a high-spec server, but that’s only about 0.1% of the installed base.

So from this we can learn that the main consideration when choosing new features to implement should be that they should be relevant to the majority of SSAS Multidimensional developers, otherwise they’ll be ignored and MS may as well have not bothered doing anything. To that we can add these other considerations:

  • These features should provide compelling reasons to upgrade from earlier versions of SSAS to the new version
  • While some features should be available in all editions, there should also be some features that encourage customers to upgrade from Standard Edition to Enterprise Edition
  • There are a limited resources (time and developers) available and Power Pivot/SSAS Tabular will be the priority, so only a few features can be delivered.
  • Features that are only there to support Power BI don’t count

With all of that borne in mind, here’s what I would choose to implement based on what I see as a consultant and from the popularity of particular topics on my blog.

Last-Ever Non Empty

One of the most popular posts I’ve ever written – by a gigantic margin – is this one on the last-ever non-empty problem. Given that so many people seem to come up against this, and that the MDX solution is complex and still doesn’t perform brilliantly, I think it should be built into the engine as a new semi-additive aggregation type. Since semi-additive measures are Enterprise Edition only, this would be my sole Enterprise Edition feature.

MDX Calculation Parallelism

Ever since I’ve been working with SSAS, people have always asked why the Formula Engine has been single-threaded. I understand why the SSAS dev team have ignored this question and instead concentrated on tuning specific scenarios: doing parallelism properly would be extremely difficult given the way MDX calculations can be layered over each other, and in plenty of cases it could lead to worse performance, not better. However I’m not asking for a ‘proper’ implementation of parallelism. I just want something dumb: a boolean property that you can set on a calculation that tells the Formula Engine to do this calculation on a separate thread. If it makes performance better then great; if not, then don’t set it. My guess is that even a crude implementation like this could make a gigantic difference to performance on many calculation-heavy cubes.

Drillthrough

Drillthrough is one of those features that almost everyone wants to use, but for some reason has been left in a semi-broken state ever since 2005. Here’s what needs to change:

  • It should work with calculated members. I don’t expect SSAS to understand magically how to work out which rows to display for any given MDX calculation, but I would like a way of specifying in MDX what those rows should be.
  • Those stupid, ugly column names – SSDT should let us specify readable column names and let us have complete control over the order they appear in.
  • Excel should allow drillthrough on multiselect filters.

‘Between’ Relationships

This might seem a bit of a strange choice, and I suspect it may not be easy to implement, but another problem that I come across a lot in my consultancy is the ‘events-in-progress’ problem. I’ve blogged about solving it in MDX and DAX, as have many others. I would love to see a new ‘between’ dimension/measure group relationship type to solve this. In fact, competing OLAP vendor iccube already implemented this and you can see how it works on that platform here and here. My feeling is that this would open up a massive number of modelling opportunities, almost as many as many-to-many relationships.

 

And that’s it, four features that I think could make SSAS Multidimensional v.next a must-have upgrade. I’m not so naive to believe that any or all of these will be implemented, or even that we’ll get any new features at all, but who knows? If you have any other suggestions, please leave a comment.

Written by Chris Webb

February 2, 2015 at 11:02 pm

Deprecated/Discontinued Functionality In SSAS 2014

with one comment

Last week while reading Bill Anton’s blog (which is, by the way, highly recommended) I came across a link to a page in Books Online that I hadn’t seen before: a list of deprecated and discontinued functionality in SSAS 2014. Here it is:

https://msdn.microsoft.com/en-us/library/ms143479.aspx

The most interesting point is that the Non_Empty_Behavior property on calculations will not be supported in SSAS v.next. I still see this property being used a lot, and as I show here if you use it incorrectly it can give you bad results. Although I have seen a few cases where it has been necessary to set Non_Empty_Behavior (for example here) they have been very, very rare and I think deprecating it is the right decision. Other than that, remote partitions, linked dimensions and dimension writeback will also be no longer supported in a ‘future’ version, but I don’t think anyone will be too worried about those features.

Written by Chris Webb

January 27, 2015 at 9:30 am

Posted in Analysis Services

Follow

Get every new post delivered to your Inbox.

Join 3,865 other followers