Building Aggregations Manually
I’ll start with two disclaimers:
- This entry is based on my own understanding of how the AS engine works, based on my own experience and the occasional question I’ve asked the dev team. It might not be totally accurate on the details but hopefully the general principles are sound – I’ve used them successfully on a number of occasions.
- 99% of the time the Storage Design Wizard and Usage Based Optimisation Wizard will design the right aggregations for you, and you won’t need to build aggregations manually.
So what about the 1% of times when you will need these techniques? They usually occur when you have several dimensions which have large numbers of members at their lower levels, and queries which retrieve data at these levels. Cubes with large parent-child dimensions and dimensions containing entities like customers are the most common real-life manifestations of this scenario. The 1/3 rule, which dictates which aggregations can be built by the Storage Design Wizard and the Usage Based Optimisation Wizard (see the Analysis Services Performance Guide for more details), may have stopped any aggregations which would be useful for your queries being built and so query performance suffers as a result. Now the 1/3 rule is there for a good reason – to stop large aggregations being built which increase processing time and the size of the cube, but don’t have much of an impact on query performance – but it is not infallible; equally, when faced with query performance problems very often you are glad of any improvement in response times, however small!
The first step to take when designing aggregations manually is to understand what data the queries you want to run are actually requesting. The best way to do this is to take a representative sample of these queries, clear the Query Log, set it to log every query and then run the queries one by one. You’ll notice that one MDX query doesn’t necessarily equate to one query in the log – in fact, one MDX query can generate multiple subqueries, even thousands, depending on its complexity; you’ll also notice that what is stored in the query log is somewhat cryptic. Rather than explain it here, I suggest you read the following white paper which explains the contents in some detail. The contents of the DataSet column represent the slice of the cube that each query in the log asked for data from; the same values can be viewed in Performance Monitor using the DSN Requested counter. The other useful value to monitor is the slice of the cube that the AS engine queried to get data to service each of these requests, and unfortunately this is not present in the Query Log – you can only see it in PerfMon using the DSN Used counter.
This is probably a good place to stop and give some examples. Imagine you have a cube with 4 dimensions apart from Measures, and these dimensions have respectively 5, 6, 7 and 8 levels each. If you ran a query requesting values at the top level of each of these dimensions, DSN Requested would show it as 1111 (top, top, top, top); similarly, if you ran a query requesting data from the leaf levels of each dimension, DSN Requested would show it as 5678 (leaf, leaf, leaf, leaf). Now if you had no aggregations in your cube and you ran the first of these queries, 1111, in order to get the value returned AS would have to aggregate values from the leaf member of every single dimension and DSN Used would show 5678; the fact that all this aggregation had to happen at runtime would mean that the query might not run very quickly. If, however, you had an aggregation built at the third level of each dimension and you ran the same query, DSN Used would instead show 3333, and because AS only had to aggregate the members on the third level of every dimension the query would run much faster. Following on from this, imagine that you wanted to run a query that took values from the top levels of the first two dimensions and the leaf levels of the last two dimensions, so that DSN Requested would be 1178. Since it is unlikely that any aggregations would have been built at the bottom level of two such deep dimensions (especially if they had large numbers of members on, for example if one was a Customer dimension), then DSN Used would have to be 5678 and AS would still have to aggregate lots of values at runtime.
Back looking at your Query Log and PerfMon, the next step is to decide whether you need to build aggregations manually. If you’ve been through the Storage Design Wizard and the Usage Based Optimisation Wizard and you’ve set the Aggregation Usage property on all your dimension appropriately (again, see the AS Performance Guide for more information on this), and you still see that your queries are not hitting aggregations (so that there are big differences between the DSN Requested and the DSN Used values) then you probably need to. On the other hand, if you see DSN Requested and DSN Used values which are the same or almost the same, building more aggregations might not do any good and you might need to look at other ways of improving performance, such as partitioning.
The tool you’ll need in order to build aggregations manually is ‘Partition Manager’, which is available in the SQL 2K Resource Kit and also in an updated form in the BI Accelerator, a free download from the Microsoft website. The UI can be a bit fiddly when you have more than a few dimensions but it beats writing the code in DSO.
The final problem is knowing which aggregations you need to build. Going back to the example above, if you saw the pattern 1178 in DSN Requested you could build an aggregation which matched this exactly, ie at the top levels of the first two dimensions and the leaf levels of the last two. This would do the trick, but if your users wanted to drill down on any of the first two dimensions then AS could no longer use it. Therefore it’s a much better idea to build aggregations at the lowest level your users are likely to drill down to, perhaps 3378, and sacrifice a little performance on some higher level queries for much better performance on queries which go slightly deeper. Overall, though, it’s only through trial and error that you’ll work out which aggregations you need.