Archive for September 2011
I’ve spent the last week working with a fairly large cube – several fact tables have hundreds of millions of rows in – and, as always on an SSAS project, there are occasions when you have no choice but to reprocess the whole cube. Sitting round waiting for processing to complete can be pretty tedious and, while I’m sure we’ve all worked out how to plan our work so that processing can take place overnight, during lunch or a meeting, it’s always good to know some tricks to make processing faster.
There’s a good set of tips on how to optimise processing in the SSAS Operations Guide, but this post is about something not mentioned in that white paper because it’s a scenario that shouldn’t happen in production – although it’s very common in development. In my case I’ve been building a cube on top of a half-finished data warehouse where the data is still quite dirty. That means that during processing there are lots of key errors, and while they get fixed I’m handling them in my processing by selecting the ‘Convert to Unknown’ option. What I’ve half-known for a long time and only properly investigated this week, though, is the impact that reporting these key errors can have on processing times.
Here’s an extreme example using Adventure Works. I’ve created a new cube based on the FactInternetSales table and added the Customer dimension to it, but deliberately joined the Customer dimension on to the Order Date column in FactInternetSales to simulate a lot of key errors.
If I then do a ProcessFull with the following error configuration:
ie with Key Error Action set to Convert To Unknown, Ignore Errors Count selected, and Key Not Found and Null Key Not Allowed both set to Report And Continue, when I do a full process all the errors are logged in the UI and it takes around 23 seconds to finish:
If, on the other hand, you set Key Not Found and Null Key Not Allowed to Ignore Error, as follows:
…a full process takes under a second:
Now this is an extreme example, of course, and in a production system you should not have any key errors anyway, and if you do you definitely do not want to ignore these key errors. But if you’re developing a cube and you know that key errors exist but you don’t care about them (for the moment), you can save a massive amount of time by not reporting these errors. In my case cube processing went down from around 3 hours to 45 minutes. Which is a good thing if you want to get on with your work faster, but I suppose is a bad thing if it interrupts your web browsing/coffee drinking/chatting or whatever you do to pass the time while processing’s taking place.
I’m very proud to say that I’ve been made an Analysis Services Maestro! You can read the official announcement here:
If you’re wondering what this actually means, you can find out more here; in a nutshell, it’s something like a MCM for Analysis Services. The course itself was pretty tough and there was a lot of work involved in doing the labs, exam and case study, but I learned a lot (especially from Thomas Kejser) and it was definitely a worthwhile thing to do.
Congratulations also go to the other ten awardees: Ulf, Alberto, Greg, Darren, Dirk, Mick, Vidas, Marco, Harlan and John.
There’s an interesting new DMV available for PowerPivot or the Tabular model in Denali called DISCOVER_CALC_DEPENDENCY that allows you to list all the dependencies between objects in your model. The full documentation is here:
…but I thought it would be good to blog about because some of the practical uses of it are not explored in BOL.
For the purposes of illustration, I created a simple Tabular model with two tables, a relationship between the tables, some calculated columns and some measures. In SQL Management Studio I then connected to this database with an MDX query window and ran the following query:
select * from $system.discover_calc_dependency
Here’s what I got back:
OK, so there’s a lot going on here and unless you’ve got really good eyesight you won’t be able to make out much of this. I’d like to draw your attention to the second column though, where you can see the types of objects we can see dependencies for: Measures, Hierarchies, Calculated Columns, and Relationships. We can filter by this column, for example using a query like this:
select distinct [table], [object], [expression] from $system.discover_calc_dependency
(None of that SystemRestrictSchema rubbish needed, thank goodness) This just returns the measures in the model, and is probably the most interesting thing we can get from this DMV. Here are the results of this query:
This then shows us a list of the three measures in our model, what table they’re on, and the DAX expression behind them. Pretty useful. Even better, though, if one measure depends on another measure or calculated column, you can find the related object and its expression too. In this case [Sum of Sales After Tax Times 2] is a measure that sums the results of a calculated column, as the following query shows:
select referenced_object_type, referenced_table, referenced_object, referenced_expression
where [object]=’Sum of Sales After Tax Times 2′
(I’m not sure where that dependency on RowNumber is coming from, though…)
So this is all very useful for you as a developer, for documentation and so on. But wouldn’t it be useful if your users could see all this too? Well, they can, using a technique very similar to the one I blogged about here. Back in BIDS, after the initial deployment of the database, I added a new Analysis Services connection pointing to the Analysis Services database I’d just deployed – so the SSAS database was using itself as a datasource. I was then able to use the first query above,
select * from $system.discover_calc_dependency
To populate a table inside my existing model:
After the model had been deployed again, this meant I could browse the results of the DMV using an Excel Pivot Table:
I’m sure in the future many users, especially if they’re PowerPivot users feeling a little frustrated at the lack of control they have over the Tabular model you’ve built, will be very interested in seeing these formulas so they can understand how they work and reuse them in their own models. And hopefully in the long run the information returned by this DMV will make not only importing data from Tabular models back into PowerPivot much easier, but also make importing parts of existing Tabular models into new PowerPivot models much easier.
I think just about anyone who’s tried to use the Earlier() function in DAX has come across the following error message:
EARLIER/EARLIEST refers to an earlier row context which doesn’t exist
As the documentation helpfully points out:
EARLIER succeeds if there is a row context prior to the beginning of the table scan. Otherwise it returns an error.
Now I’m not going to try to explain what “row context” actually means right here in any detail (this article is a great place to start, and Marco and Alberto do a great job in chapter 6 of their PowerPivot book), although it is a fundamental concept in DAX that you do need to understand thoroughly, but the presence or otherwise of a row context is particularly important when using Earlier().
Take the following model with a single table sourced from the Adventure Works DimDate table. If you create a calculated column using the following expression:
=calculate(countrows(DimDate), All(DimDate), DimDate[DayNumberOfWeek]=earlier(DimDate[DayNumberOfWeek]))
It works fine, as shown below:
For each row, we’re saying that we want to do another scan of the table and find out how many rows in the entire table have the same value in the DayNumberOfWeek column as the current row. Earlier() needs at least two table scans to work and we have that here: one which is inherent in the nature of creating a calculated column, and one that is triggered by the Calculate statement.
However, if you try to use exactly the same expression in a measure, you get the error above:
This is because when a measure is evaluated any external row context is automatically turned into a filter context, so in this case there’s only one row context – that created by the Calculate statement itself.
Anyway, this is all well-documented stuff and something we should all know (though, I guess like everyone else, I’m still relatively new to DAX and forget these things from time to time…). Recently I was playing around with DAX queries and found something I really couldn’t understand. Using the example model above, I found that while creating a measure in a query failed as I would expect:
measure dimDate[Demo] =
I found the following query, where the same DAX expression was used in AddColumns(), worked:
What I came to realise after much patient explanation on the part of Jeffrey Wang and Marius Dumitru was the fairly simple fact that AddColumns() is not the same as creating a measure – in fact it’s like the first example above in that it behaves as if you are creating a new calculated column on the table returned by the query, and indeed we get the same values returned as we get in the calculated column example above. Therefore, with AddColumns(), we have the two row contexts we require to make Earlier() work.
One widely-acknowledged weak spot of Analysis Services and MDX today is that, while the performance of queries showing a small number of rows and aggregated values is usually very good, performance can be bad when you’re reporting on detail-level data and returning a potentially large number of rows. This is due to a number of different reasons (such as the often poor MDX generated by tools like Excel, the fact you have to crossjoin many hierarchies together and then do a NON EMPTY for this kind of report, the fact that you may need to use ROLAP storage for very large hierarchies, and so on), and although matters have improved over the years and there are workarounds for many of these issues, I’d say it’s one of the main reasons that the concept of the Unified Dimensional Model (ie ‘one reporting model to rule them all’) never caught on – some types of reports were always going to be easier to write and faster to execute in SQL than in MDX.
Of course now we have UDM v2.0 in the shape of the ‘BI Semantic Model’ and the Tabular Model the question of detail-level reporting rears its head again. We won’t be able to say whether performance for these queries is good enough in the Tabular Model until RTM because we can expect a lot of tuning and bug-fixing will go on between now and then, but several things already suggest that DAX has several advantages over MDX for detail reporting:
- Anecdotal evidence suggests that detail-level queries going against the same Tabular model are faster when written in DAX than in MDX.
- DAX as a language is much better suited to detail-level reporting: as I showed in my recent series on DAX queries, filtering, subtotalling and pagination are all much easier in DAX than in MDX.
- DAX queries return columns that are typed, whereas the columns that are returned from an MDX query are usually interpreted as text, strings or variants. This makes DAX queries much easier to work with in tools like SSRS where you might want to do calculations or filtering in the reporting tool rather than the query.
However, out of all Microsoft BI tools only Crescent will natively speak DAX when Denali is released – Excel pivot tables will still, of course, be generating MDX in the background and I guess there won’t be a new DAX query builder for traditional SSRS reports as far as I know. That doesn’t mean we can’t use DAX in Excel and SSRS though…
The tricks discussed here can be used to create an Excel table bound to a DAX query; in particular it’s very easy to use the method Greg Galloway came up with of creating a pivot table, double-clicking on a cell to drillthrough, then editing the query behind the new table to use a DAX query. Parameterisation requires a bit of VBA coding but that’s not difficult to do; and of course in many cases it may be sufficient to write a non-parameterised query, bind it to a table and let the native Excel functionality do the rest.
For SSRS, you have two options. You can use the Analysis Services connection type but in CTP3 at least you can’t enter a DAX query in the MDX editor – you get an error. Greg came to my rescue once again though by pointing out that you can get DAX queries to run by clicking the “Command Type DMX” button in the query designer (in the same method you need to use to get an MDX Drillthrough statement to run):
The only drawback with this approach is that the columns in the dataset don’t come back typed in SSRS. The alternative is to use an OLEDB connection to Analysis Services instead (see here for how to do this for an MDX query – the steps are the same for using a DAX query in an OLEDB dataset). If you do this the columns do come back typed but you can no longer parameterise your query directly; you need to generate your DAX query using a dynamic expression if you want to parameterise it which is a royal pain to do.
So not ideal but I think for SSRS at least the benefits of using DAX for reporting outweigh the disadvantages. Even for those people for whom Crescent is not an option because of the Sharepoint dependency, but who want to do a mixture of traditional ad-hoc analysis and detail-level reporting, I suspect that using the Tabular model and DAX plus Excel and SSRS might be a better choice than the Multidimensional model in many cases.
I’m pretty busy this autumn speaking at conferences and running public courses, so I thought I’d list some of the dates here to boost attendance. Hopefully I see some of you there!
First of all, along with Allan Mitchell I’ll be speaking at the Isle of Man BCS in Douglas (at the Tynwald, no less) on September 13th:
Next up there is of course SQLBits in Liverpool, 29th September to 1st October. I’ll be doing a session on ‘Basic BISM’ on the Friday:
Then, on Tuesday 4th October, I’ll be speaking at the Herts SQL Server User Group at Hatfield, as part of SQL Relay:
Then it’s off to PASS in Seattle. I’m not doing a full session this year, hohum, but I did manage to get a slot doing a Lightning Talk:
The week after that I’m in Sweden, running a public SSAS cube design course in Stockholm on October 20th:
Then the week after that it’s my public ‘Introduction to MDX’ course in London (book soon to avoid disappointment):
On November 8th-9th I’m back in Sweden for SQL Rally Nordic:
Finally, I’ll be in Belgium for the the SQL Server Days conference on the 14th-15th November: