Archive for December 2007
It’s getting close to the deadline to submit a session for SQLits II next year. I know what you’re thinking, you’d like to attend but don’t have the time to prepare a presentation. Aren’t you tired of hearing the same old people (eg me) present though? Wouldn’t you like to share your hard-won knowledge with the community? Get your 45 minutes of fame and adoration? Of course you would. So why not go to
…and submit something right now? I’d especially like to see some more BI sessions.
BTW, if you’re on Facebook you can join the SQLBits group to stay up-to-date with all the latest news and meet up with other people who’ll be attending.
Recently I was working on a project where the customer wanted to upgrade from SP1 to SP2 but found that when they did so the results of some of the calculations on their cube changed. Unfortunately these changes weren’t easy to spot: their MDX Script was >1500 lines long and the bugs they were running into were dependent on the query being run and whether the cache was cold or warm (in fact one of them was the bug I blogged about here). When testing various workarounds and hotfix builds, the problem I ran into was a problem I’ve run into before on numerous occasions: when I changed something on the MDX Script, how did I know it didn’t break something else?
In the past what I’ve done was take a few MDX queries, run them, copy and paste the results into Excel, make my changes and then rerun the same queries and paste the new results into Excel again and use Excel formulas to compare the old and new resultsets. However in this case this approach just wasn’t feasible and I started looking for a better way of comparing large numbers of MDX resultsets. Now there is definitely a tool to be written here, probably one which runs batches of queries, serialises the cellsets and saves them to disk, and then compares the cellsets and highlights any cells with differences, but I didn’t have time to write it. I was talking the problem over with Didier Simon of PSS, though, and he made a great suggestion – use tablediff.exe, the command line utility you get with SQL Server to compare relational tables. You can read more about it here:
This turned out to be the key. I cannibalised bits of my Analysis Services cache warmer and created two SSIS packages, the first of which ran a batch of MDX queries retrieved from a relational table (originally saved from a Profiler trace) through SQL Server using OpenRowset and saving the results in a table using a SELECT INTO. Here’s what the control flow looked like:
And here’s an example of the kind of SQL query I was generating:
select identity(int, 1,1) as idcolumn, * into Query1
from openrowset(‘MSOLAP’, ‘Datasource=localhost; Initial Catalog=Adventure Works DW’,
‘ with cell calculation test for ”(*)” as ”cstr(round(measures.currentmember,5))”
select Measures.[Internet Sales Amount] on 0,
[Date].[Calendar Year].Members on 1
from [Adventure Works]‘)
A couple of things to note – Tablediff needs a column which can uniquely identify each row in the tables its comparing, hence the identity column. I also found that I was running into the precision problems described here and as a result, tablediff was flagging up cells as having different values when in fact the values were differing only by a miniscule amount, so I created a calculated cell assignment in the WITH clause which rounded up every value to five decimal places. I could do this because I knew none of my queries had WITH clauses, but another solution would have been to create a session calculated cell assignment which did the same thing.
After running this package I had a SQL Server database with as many tables containing results as I had queries. I could then make my changes to the cube and rerun the same package to dump a new set of results to a different database, and then run my second package to compare the results. This was, again, a fairly simple package to put together:
All I did was use a ForEach loop to loop over the tables in the database containing the first set of results, and then use an Execute Process task to call tablediff to compare each table with the equivalent table in the second database; if it found any, I used the -et argument to save the output to another table. I also used the -c argument to get column-level differences and the -b argument to specify the number of bytes to compare for large objects, necessary because the SELECT INTO creates an ntext column for all the member names on the Rows axis.
Overall, it’s not an ideal solution (coding a proper app is definitely the way to go) but it did the job, and hopefully this will be useful to someone else…
Something that’s easy to forget (at least I’m always forgetting it) when creating an aggregation design is the impact that partitioning can have on it. Aggregations are only created on a per-partition basis – you can’t aggregate across partitions – so that means, for example, when you’re partitioning by Month then there’s little point building aggregations at granularities higher than Month; after all, if you only have one Month of data in your partition and you aggregate that data to the Month, Quarter or Year levels then the figures for each aggregation will be identical! In fact building aggregations above the granularity of Month can be a bit of a waste: an aggregation at Month granularity can be used by queries at Quarter and Year, but an aggregation for the same partition at Year granularity that is in all other respects identical can only be used by queries at the Year granularity.
The best thing to do to ensure this doesn’t happen is to set the AggregationUsage property to None for all attributes above the granularity that you’re partitioning by, and also to make sure your partition row counts are set correctly, before you run either of the Aggregation Design wizards. This should ensure that your aggregation design properly reflects your partitioning strategy.
One of the questions I’ve always meant to research further on this topic is whether partitioning could hurt query performance, given that it effectively prevents certain aggregations being built. Taking the example we’ve been using so far, what happens if you’re partitioning by Month and you’ve got a slow query at Year granularity that you’d like to build an aggregation for – would partitioning by Year rather than Month, so you could build a true Year-level aggregation, make sense? Well, I’ve just had a chance to test this out on a fairly large cube (with approx 100 million rows in the main fact table, and Product and Customer dimensions with 1 million+ members on their key attributes) and interestingly partitioning seems to make no difference at all. I created two measure groups, one partitioned by Month and one partitioned by Year, and then created one aggregation on the former at Month/Product and one aggregation on the latter at Year/Product. I then ran a query that returned the top 10 Products at Year on each measure group and the performance was identical. Clearly this is not something to be worried about, thank goodness…
[Thanks are due to Eric Jacobsen of the SQLCat team for talking through some of these issues after the BI Conference earlier this year. He accurately predicted that partitioning wouldn't hurt performance because the AS query engine is very efficient at merging the data caches of partitions.]