Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Storage Engine Cache Aggregation and its Implications for Dimension Design

with one comment

SSAS Multidimensional does caching in a number of places, but the most important type of caching for most cubes happens in the Storage Engine (SE) – the part of SSAS that reads data from disk and aggregates it up to a given granularity. After the SE has answered a single request for data it stores the resulting subcube in cache so that if the same request is made in the future then it can be answered from there very quickly. In addition, in some cases the SE is also able to answer requests at a higher granularity from data that is currently in its cache, and increasing the likelihood of this happening can have a significant positive impact on the performance of large cubes.

Let’s see an example of how SE caching works using a very simple cube built on the Adventure Works database, with one Sum measure and just one dimension, a Date dimension with the following attribute relationships:

image

Running the following query on a cold cache:

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…and then looking in Profiler shows that the SE has to go to disk to get the data it needs (as shown by the Progress Report Begin/End events):

image

Running the same query immediately afterwards shows the SE can get the data it needs from cache:

image

Running the following query, where I’m getting the All Member from the Year hierarchy, shows that the SE is also able to answer this request from cache:

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].[All]}
on 1
from [SECacheDemo]

image

This is as you’d expect because, of course, the All Member on the Years hierarchy represents the aggregated total of all the years returned in the first query.

There are several limits on the ability of the SE to derive aggregated totals from data it already has in cache. For a start, the SE cannot aggregate multiple cache entries to derive a single figure. So, for example, if I run the following three queries:

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2001],[Date].[Year].&[2002]}
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2003],[Date].[Year].&[2004]}
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2006]}
on 1
from [SECacheDemo]

…which together return all the years, when I run the query to get the All Member that will not be answered from the SE cache. Each of the three queries above create separate entries in the SE cache; this is one of the reasons why, when cache warming, it’s better to use a few very large queries rather than lots of small, filtered queries.

Furthermore (and this is something that surprised me a little when I found out about it recently), despite the presence of attribute relationships, the SE cannot always work out how to derive higher-level values from lower-level cached data. Running the following query on a cold cache:

select {[Measures].[Sales Amount]} on 0,
[Date].[Date].[Date].members
on 1
from [SECacheDemo]

image

…populates the SE cache with data at the Date granularity, but the following query to get the values for all years:

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…is not answered from cache, even though the year values could be derived from the date values already in cache.

image

Why is this happening? Well, the SE does not (at least at the time of writing) decode the attribute relationships when working out which granularities can be aggregated from cache. If you look at the granularities as represented in the Query Subcube Verbose events shown in the previous two screenshots, the granularity for the request at Date is
* 0 0 0
…and the granularity for the request at Year is
0 0 0 *
The four characters in this representation of the granularity stands for the four attributes on the dimension: Date, Month, Quarter and Year. The 0 character shows that a request is not at the granularity of that attribute, any other value shows that it is, and the asterisk character shows the request returns all the values at the specified granularity (this white paper gives more detail on how to interpret these values). So, without knowing anything about attribute relationships, the SE can say that the granularity
0 0 0 0
can be aggregated from
* 0 0 0
but it cannot say that
0 0 0 *
can be aggregated from
* 0 0 0

Luckily these limitations on what can be aggregated do not apply to aggregations: if I was to build an aggregation at the Date granularity, my query at the Year granularity would be able to make use of that aggregation.

Also, the use of natural user hierarchies can work around this limitation. Consider the following user hierarchy built on the dimension:

image

Querying at the Date level of this user hierarchy, like so:

select {[Measures].[Sales Amount]} on 0,
[Date].[Calendar].[Date].members
on 1
from [SECacheDemo]

image

…results in a request at the granularity
* * * *
which can then be aggregated up to many more granularities – querying at a level in a natural user hierarchy automatically includes the granularities of all the attributes used for the levels above in the user hierarchy.

Therefore, both the following queries:

select {[Measures].[Sales Amount]} on 0,
[Date].[Calendar].[Year].members
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…can be answered from the cache created by the query at the [Date].[Calendar].[Date] level.

The conclusion must be, then, that using natural user hierarchies will allow for much better SE cache reuse than using the attribute hierarchies on their own. Also, if you have a natural user hierarchy, it might be a good idea to hide the underlying attribute hierarchies so users and other developers do not reference them in their queries and calculations. You may not notice the performance difference that better SE cache reuse gives you on most cubes, but on very large cubes or cubes that are very SE-intensive (for example, because they are reprocessed frequently) this could make a noticeable difference to your overall query performance.

Thanks to Akshai Mirchandani and Hrvoje Piasevoli for their help in understanding this.

Written by Chris Webb

July 2, 2012 at 2:07 pm

One Response

Subscribe to comments with RSS.

  1. Nice overview, thank you !

    F.L.

    July 3, 2012 at 7:57 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,302 other followers

%d bloggers like this: