Designing Effective Aggregations in AS2005
Aggregations are, as we all know, the key to getting the best performance out of your cube. In AS2K it was usually enough to run the Aggregation Design Wizard to get a reasonable aggregation design; in my experience with AS2005 it’s much harder to achieve good results – a lot of people run the wizard and find that performance is still poor. So what I thought I’d do is detail the process I go through when I’m designing aggregations…
- The first step in getting good performance is making sure your dimension design is as clean as possible. The commonest thing I see when I do performance tuning on a cube is that there are loads of useless attributes in the dimensions which have been created by the cube design wizard. Let’s take a time dimension as an example: imagine you had a table with four columns, timeid (containing the surrogate key), month, quarter and year, where month is the lowest level of granularity. In a lot of cases the wizard will translate this to four attributes, one based on each column, but of course you don’t need an attribute based on the timeid column – no user is ever going to want to analyse data by a surrogate key value. Since timeid is in fact the key value corresponding to the month name given in the month column, what you should do in this case is delete the month attribute, set the name column property of the timeid attribute to point to the month column and rename the timeid attribute to Month; its key column property will stay pointing to the timeid column. As a result you have one attribute where you used to have two, a lot less complexity in the ‘space’ of your cube which will make designing aggregations easier, and a more efficient design.
- Set the AttributeHierarchyEnabled property of any attributes which you don’t want users to analyse by to False. An example would be an address attribute, which you might want to display as an AS2K-style member property, but you would never expect a user to drag onto a grid or slice by on its own. Again, this reduces the ‘space’ of the cube and makes designing aggregations easier by reducing the number of attributes that need to be considered during design.
- Next, I always see if I can design in one-to-many relationships between attributes where possible. Going back to the time dimension example, it could be that in the quarter column of your dimension table you have values like ‘Q1′ and ‘Q2′ – these tell you what quarter any given month is in, but don’t tell you what year the quarter is in, so there are no one-to-many relationships between year and quarter and quarter and month. If you add a column concatenating the quarter with the year, ie so you get values like ‘Q1 1999′ and ‘Q2 2000′ and build your quarter attribute from the new column, one-to-many relationships will exist; if your users still want to analyse by quarter independently of year then you can still build an attribute from that column too. Quite why this is a good thing will become clear later on.
- Model attribute relationships between attributes where possible. Attribute relationships are, as I’ve said on several occasions in the past, the single most important thing to consider when designing a dimension and because their importance is not flagged anywhere in the UI or BOL I find they’re widely neglected. Setting them correctly is extremely important for aggregation design and also for allowed AS to use aggregations at lower levels of granularity when your query doesn’t hit an aggregation directly. For more information about attribute relationships and how to set them, I recommend taking a look at the TechEd presentation on AS performance I linked to in my last post http://cwebbbi.spaces.live.com/blog/cns!7B84B0F2C239489A!906.entry, BIN 316. You should also read my post on the dangers of modelling redundant attribute relationships: http://cwebbbi.spaces.live.com/Blog/cns!7B84B0F2C239489A!619.entry
- Model all the user hierarchies you think you’ll need. If you’ve set you’re attribute relationships you’ll (hopefully) find that a fair propotion of them are natural hierarchies, ie a one-to-many relationship exists between each level. This is probably a good place to link to Elizabeth Vitt’s excellent post on influencing aggregation candidates: http://www.sqlskills.com/blogs/liz/2006/07/03/InfluencingAggregationCandidates.aspx . Before you read this post any further I suggest you read, learn and inwardly digest her post. The only thing I disagree with her about is possibly a typo – at the end of her post she says "scenario 6 and scenario 2 provide the best solutions from an aggregation candidate point of view" - I think she meant to say scenario 6 and scenario 4, since scenario 4 has attribute relationships set whereas scenario 2 doesn’t.
- Set the AggregationUsage property on each attribute appropriately. Elizabeth’s post above is very good on what this does; the only thing I have to add is that in many cases I find that relying on natural user hierarchies alone to influence aggregation design isn’t enough. For example, I find that on a Time dimension users almost always analyse data at the month attribute level even when higher levels of granularity on that dimension exist – in these cases I set the AggregationUsage property to Full for the month attribute, and this ensures that every aggregation designed by a wizard includes the month attribute. If you don’t do this then you might find you run the wizard and no aggregations are built at month, and so when your users start querying they get no benefit from the aggregations you’ve built. Similarly, I find it’s a good idea to set AggregationUsage to Unrestricted on any attribute hierarchies you’re exposing to users which you expect to be queried extensively, and setting AggregationUsage to None on any attributes which are part of user hierarchies (especially at higher levels) but which you nevertheless don’t expect to be queried that much.
- Run the Aggregation Design Wizard. What I usually do before this is to deploy and fully process my cube with no aggregations; you can then run the wizard and design aggregations, then process the aggregations separately by doing a Process Index. This way you can see what the performance of your cube is like before aggregations have been built and get a better feel for how much of an impact they’re making on performance; you can then throw your aggregations away, design some new ones and do another Process Index (which is generally fairly quick) if you need to. When you run the wizard, on the ‘Specify Object Counts’ step, where possible click count to make sure that the values in the Estimated Count column are as up-to-date as possible. More importantly, if you’re partitioning your cube, you need to make sure that the values in the Partition Count column reflect the count of the numbers of members on each attribute that will exist in each partition: for example, if you’re partitioning by month, you will want to tell AS that there’s only one month in each partition. These counts are, I believe, used during the aggregation design process to estimate the size of an aggregation, and if you don’t specify correct values then the algorithm may incorrectly think a useful aggregation is much bigger than it actually would be and not build it as a result. On the ‘Set Aggregation Options’ step of the wizard the rule was in AS2K to choose the ‘Performance Gain Reaches’ option and set it to stop at 30%. This is still good advice but more and more in AS2005 I find myself using the ‘Estimated Storage Reaches’ option instead: in an AS2005 cube there are a lot of very small aggregations that can be built, and by stopping at 30% Performance Gain you may find that you have still only built a few Kb of aggregations. If you select the ‘I click Stop’ option and watch the design grow until the estimated size is ridiculously large (maybe over a couple of Gb) you can then get a feeling for how many small aggregations can be built; you can then stop it, reset the aggregations and then restart using either the ‘Performance Gain’ or ‘Storage Reaches’ option set to an appropriate level.
- Design aggregations manually if necessary. In AS2K you only needed to design aggregations manually in very rare cases, but I find myself doing it much more frequently in AS2005 typically when I have a particular query or RS Report which needs tuning. There is apparently a tool in the pipeline which will help with this task (I’ve heard several people mention it on the newsgroup) but I’ve not seen it yet and I don’t know when it will be released; in the meantime designing aggregations manually means hacking the XMLA definition of a measure group. When you design aggregations using one of the wizards, the end result is an ‘aggregation design’ which gets saved on the measure group; each partition then has a property which refers to the id of the aggregation design on the measure group its assocated with. In SQL Management Studio, to see an aggregation design, you need to connect to AS in the Solution Explorer pane, expand your cube so you can see the measure groups in it then right-click on a measure group and select ‘Script Measure Group as/ALTER to/New Query Editor Window’. In the resulting XMLA if you collapse the Measures, Dimensions and Partitions nodes you should be able to see the AggregationDesign node; if you collapse the Dimensions node underneath it you’ll see an Aggregations collection containing the list of aggregations in the aggregation design. It’s pretty easy to work out what’s going on: each aggregation has a name and id which by default is a number in hexadecimal, and it consists of a collection of dimensions. If there’s an aggregation designed which is at a lower granularity than the root of the dimension then there’ll be an attribtues collection, inside which will be one or more attribute objects. An aggregation is just fact data summarised up to a certain level, and when you see an attribute mentioned in an aggregation definition you know that the aggregation contains data summarised up to that level of granularity. To design your own aggregations manually all you need to do is copy the last aggregation in the list, paste it into the definition at the end of the list, update its id and name properties and add or remove attributes to set its granularity; to determine what granularity you need to build your aggregation at you’ll find it useful to run a Profiler trace and look at the Query Subcube and Query Subcube Verbose events which Mosha describes in more detail here: http://sqljunkies.com/WebLog/mosha/archive/2006/01/05/cache_prefetching.aspx. Assuming you’re editing an ALTER TO script you can then just run the command in SQLMS and run a Process Index to get all your aggregations, including the new one, built.
- Run the Usage Based Optimisation Wizard. In theory, Usage-Based Optimisation should be the icing on the cake as far as performance tuning goes. All you should need to do is turn on the query log (right-click on the server name in SQLMS and look at the server properties to do this) so that the queries your users are actually running are captured, then you just run the wizard, select which queries you want tuned and then extra aggregations are built appropriately. This is how it worked in AS2K, but the other week I made the discovery that in AS2005 when you run the wizard it overwrites any aggregations you have previously designed on the cube without giving you the option to add to the ones you’ve already got, so if you’re not careful you could end up tuning one set of queries and find that other ones start running slower. To stop this happening what you’d need to do is script your measure groups before running the wizard to save your existing aggregations, then after the wizard has completed try to merge the two sets of aggregations together manually. This isn’t exactly ideal so I’ve opened a Connect to try to get it changed: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=221844 Feel free to vote on this issue to get it fixed!
- Check to see whether aggregations are actually being used by your queries. You can do this very easily in Profiler that includes the ‘Get Data From Aggregation’ event (found under ‘Query Processing’ and only visible if you check ‘Show all events’). It’s worth remembering that even when your query hits an aggregation directly it will still perform poorly – there are a lot of other factors in play such as calculations which can affect query performance.
So anyway, there are my thoughts on designing aggregations. If I’ve forgotten anything or got anything wrong, or you’ve got any of your own tips to add, please leave me a comment! Hopefully the AS2005 version of the "Analysis Services Performance Guide" which I understand is coming soon will contain all of this information and more, but in the meantime I hope you find it useful.
UPDATE: James Snape has a useful blog post describing what all those 1s and 0s mean in Profiler – something I probably should have explained myself: