One Cube vs Multiple Cubes
One of the questions discussed in the book that Marco, Alberto and I wrote last year, “Expert Cube Development with SSAS 2008” (available in all good bookshops, folks!) was whether, if you have multiple fact tables, you should create one big cube with multiple measure groups or multiple cubes each with a single measure group. While I still stand by what we wrote then, I recently took part in an interesting debate on this subject in the MSDN Forum with Akshai Mirchandani from the dev team about the pros and cons of each approach where some interesting new details came to light:
Here are the main points you need to consider when deciding whether to use the single cube approach or the multiple cube approach. In favour of the multiple cube approach:
- Having multiple, smaller cubes may result in faster query performance than one large cube in some cases, especially if your fact tables have very different dimensionality. This was true in SSAS 2005, and while it’s less obvious in 2008 it’s apparently still there. This was what I’d previously not been sure about: I’d heard rumours about this, and seen it happen in some cases myself with 2005 – although in other cases when I’d tested this out I’d seen no difference in performance – and I wasn’t sure what the situation was with 2008. You’d need to test the two approaches yourself on your cubes and queries to be sure. Here’s what Akshai says on the matter:
if you keep adding lots of dimensions to a cube, then the virtual space of the cube grows — it does not add to the storage cost, but it does hurt formula engine performance in some scenarios because the cell coordinates are based on the number of attributes in the cube space. Increasing the number of attributes in the cube space will start costing performance in lots of small ways and result in performance regressions. Adding lots of unrelated measure groups would result in you adding lots unrelated dimensions to the cube space and cause a performance slowdown — if you had 10 measure groups and they all shared lots of common dimensions, then one cube makes the most sense…
…As I already explained… it affects the sizes of data structures inside the formula engine that are based on the number of attributes in the cube space. When those data structures get larger, there is an incremental cost that can add up (depending on your calculations and query patterns).
For example, you see the Query Subcube Verbose events in Profiler — they show you the subcubes that are used for accessing measure groups. There are similar subcubes that are used for calculating formulas and cell coordinates — all those subcubes get wider and wider as you start adding more attributes into the cube. The cost of accessing and indexing those data structures is what we’re talking about here. If adding new measure groups doesn’t require adding new attributes/dimensions, then there is no problem…
We had measured the difference before 2005 shipped for some real customer cubes and found there there was a noticeable performance improvement to split up into multiple cubes…
- While it is possible to apply dimension security to the Measures dimension, it is much easier to allow or deny access to a cube with the multiple cube approach than it is to apply security to all the measures in a measure group using the single cube approach.
- Having multiple, simpler cubes can be much more user friendly than one monster cube with loads of dimensions and measure groups. If you have Enterprise Edition you can of course use Perspectives to counter this, but if you are using Standard Edition then Perspectives aren’t available.
- Maintenance can be easier and less disruptive with multiple cubes: if you need to make changes to a cube while users are querying it, you might end up invalidating users’ connections and dropping caches. With one cube the chances of this disruption affecting more users increases.
- It’s easier to scale out with multiple cubes: if you find your server is maxing out, you can simply buy another server and distribute your cubes equally between the two. With a single cube approach you end up having to look at (admittedly not that much) more complex scale-out scenarios like network load balancing.
On the other side, here are the arguments in favour of the single cube approach:
- If you ever need to work with data from two fact tables in the same query or calculation, or if you think you might ever need to in the future, you should go with the single cube approach. The two options for cross-cube querying, linked measure groups and the LookUpCube MDX function, should be avoided. Linked measure groups are a pain to manage, carry a slight query performance overhead, and can result in the same MDX calculations being duplicated across the original cube and the cube containing the linked measure group (which means maintenance becomes more difficult). The LookUpCube function is probably the worst MDX function to use in a calculation from a performance point of view and should be avoided at all costs. So a single cube is the only feasible option.
- Even if your users tell you they will not ever need to analyse data from two fact tables in the same query, be prepared for them to change their minds. In my experience, SSAS projects have a tendency to grow in complexity over time, and cubes that start out simple in a first release often grow lots of new functionality as time goes on – and the more successful the project, the quicker things get complicated. As soon as your users see what’s possible with SSAS they will start to have new, more ambitious ideas about the kind of analysis they want to do with their data, and it’s very likely that they will realise they do need to do cross-measure-group queries and calculations. If you started out on the multiple cube approach and then this happens you will have no choice but to use linked measure groups, and as I said this can make maintenance difficult; using the single-cube approach from the start means you won’t have this problem.
My personal preference is to use the single cube approach by default, and then move to multiple cubes if there are pressing reasons to do so, for example if query performance is a problem. This might seem a bit strange given the number of reasons I’ve given for the multiple cube approach, but frankly the need to support cross-measure-group querying and calculations trumps them all. As I said, if you need to do it (and 99% of the time you will), or you even half suspect you might need to do it sometime in the future, you have to go with the single cube approach. That said, I know other people are more inclined to the multiple cube approach than I am and to a certain extent it’s a matter of taste.