Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for February 2008

Using Non_Empty_Behavior With YTD calculations

with 3 comments

I always believe in giving credit where credit’s due, and before I go on with this blog entry I have to declare that this isn’t my idea. I heard about the general approach from Eric Jacobsen (or possibly Richard Tkachuk) at the BI Conference last year and he said it originally came from Thomas Kejser who is now on the SQLCat team; I’ll be working with him on a project fairly soon so I’ll ask him about it! I’ve not actually seen it implemented or written about anywhere though, and since I’ve only just got around to testing it out and seen how useful it can be I thought I’d blog about it.

Let’s start by taking a look at the following Adventure Works query:

with member measures.ytdsales as
sum(
periodstodate([Date].[Calendar].[Calendar Year],[Date].[Calendar].currentmember)
, [Measures].[Internet Sales Amount])

select {[Date].[Calendar].[Month].&[2003]&[12].children} on 0,
non empty
[Customer].[Customer Geography].[Customer].members
*
[Promotion].[Promotion].[Promotion].members
on 1
from [adventure works]
where(measures.ytdsales, [Product].[Subcategory].&[2])

It runs in 1 minute 10 seconds on my laptop and returns 33 columns and 2585 rows. YTD calculations are now very efficient in AS2005 SP2 and as far as I can see the real cause of the problem is the extremely large crossjoin on rows and the associated non empty. Relatively few customers actually bought anything in 2003 in this particular subcategory and when they did it was usually associated with no discount, so there’s a lot of empty tuples that we are going to need to remove on rows; but in order to be able to filter them out, at the moment, we have to calculate the ytd sales for every single one of them. Now if we had control over the MDX of this query we could probably do some clever stuff with the NonEmpty function that would improve performance a lot, but what if users are querying with an ad hoc tool like Excel or Proclarity? It would be nice if we could set the Non_Empty_Behavior property on this calculated member, but that involves having a real measure on the cube whose value is null when ytd sales is null, and we don’t have one of those…

So why don’t we build one? In fact it’s much easier to do this than you think especially when you remember that we’re not interested in the value of this measure at all, just whether it’s null or not. What we need to do is take our existing fact table, do a group by to find the minimum date for every key combination and then generate a row in the new fact table from that date to the end of the year. So to take a simplified example if you had a fact table with two dimensions, Product and Time, and the granularity of Time was Month, you would start with this:

Product

Time

Sales

Apples

October

100

Oranges

November

150

and want to get to this:

Product

Time

Apples

October

Apples

November

Apples

December

Oranges

November

Oranges

December

Here’s a SQL query which does just that for the FactInternetSales fact table in Adventure Works DW:

select
f.ProductKey, f.DueDateKey, f.ShipDateKey,
f.CustomerKey, f.PromotionKey, f.CurrencyKey, f.SalesTerritoryKey,
c.TimeKey, f.SalesOrderNumber, f.SalesOrderLineNumber
from
(select i.ProductKey, i.DueDateKey, i.ShipDateKey,
i.CustomerKey, i.PromotionKey, i.CurrencyKey, i.SalesTerritoryKey
, i.SalesOrderNumber, i.SalesOrderLineNumber
, min(i.OrderDateKey) as mindate, t.CalendarYear
from
dbo.FactInternetSales i inner join dbo.DimTime t
on i.orderdatekey = t.timekey
–where SalesOrderNumber=’SO43701′
group by t.CalendarYear, i.ProductKey, i.DueDateKey, i.ShipDateKey,
i.CustomerKey, i.PromotionKey, i.CurrencyKey, i.SalesTerritoryKey, i.SalesOrderNumber, i.SalesOrderLineNumber) f
inner join
DimTime c on f.CalendarYear = c.CalendarYear
and f.mindate<=c.TimeKey

You’ll have to excuse my SQL – it’s pretty ropey – but even though I’m sure this could be optimised it still runs very quickly; if you uncomment that line with the where clause in the middle, so the table is filtered down to one SalesOrderNumber, you’ll see what it’s doing more easily. Now you’ll probably be worried, quite rightly, that doing this creates a fact table that’s much larger than your original fact table and that’s certainly true. In this case FactInternetSales is just over 60000 rows and the query above returns almost 7.5 million rows; however depending on the granularity of your fact table and the nature of your data the explosion could be a lot less, and we’re not interested in any of the measure columns in the original fact table either so this new table will not be as wide.

The next step is to take the query above and use it as the basis of a named query in the Adventure Works dsv, and then build a new measure group in the Adventure Works cube from it containing a single Count measure. Partitioning and aggregation might be useful here too, but I didn’t bother with them and just went ahead and processed; I was pleasantly surprised at how quickly it processed, I guess because of the lack of measures.

Anyway, now we’ve got a measure in our cube we can use for our Non_Empty_Behavior property. I called it [Fact NEB Count] and here’s the new version of the calculation in the original query:

with member measures.ytdsales as
sum(
periodstodate([Date].[Calendar].[Calendar Year],[Date].[Calendar].currentmember)
, [Measures].[Internet Sales Amount])
, non_empty_behavior=[Measures].[Fact NEB Count]
select {[Date].[Calendar].[Month].&[2003]&[12].children} on 0,
non empty
[Customer].[Customer Geography].[Customer].members
*
[Promotion].[Promotion].[Promotion].members
on 1
from [adventure works]
where(measures.ytdsales, [Product].[Subcategory].&[2])

The query now runs in, wait for it, in 7 seconds on a cold cache in SQLMS and if you look in Profiler you’ll see that in fact the majority of that time is taken up by SQLMS rendering the resultset – it takes just over 2.5 seconds to actually run on the server. A lot of effort, certainly, but a big improvement. Of course I’ve cheated a little bit and used a query that shows a particularly large change – the larger percentage that rows with data make up of the overall number of rows in the crossjoin, the less the difference will be. That said, I’m working with a customer at the moment for whom this is a very valid scenario: they have a lot of customers and a lot of products, but a customer only tends to buy one or two products and the users love to run big queries with YTD measures.

Written by Chris Webb

February 27, 2008 at 6:30 pm

Posted in Analysis Services

Parameterising Calculated Measure Definitions

with 9 comments

I’m currently holed up in my hotel room somewhere in Sweden facing up to the fact that I’ve barely started work on my presentation for SQLBits on Saturday. It’s on the subject of using Analysis Services as a data source for Reporting Services, and as it happens one of the tips I’ll be talking about came up at work today – handling parameterised calculated measure definitions – so I thought I’d blog about it too.

RS reports commonly use calculated measures in the WITH clause to get around that annoying problem that RS needs to know about its column names in advance. If you’re creating a parameterised report which allows your users to choose which columns appear when it renders, you might be tempted to write your query something like this:

with member measures.test as
//this would actually be strtomember(@MyParameter) in the query
strtomember("[Measures].[Internet Sales Amount]")
select measures.test on 0,
topcount(
[Customer].[Customer].[Customer].members
*
[Date].[Calendar].[Calendar Quarter].members
, 100, measures.test) on 1
from [adventure works]

But performance isn’t great here: the query executes in 23 seconds cold cache on my laptop. When you compare it with the un-parameterised version:

with member measures.test as
[Measures].[Internet Sales Amount]
select measures.test on 0,
topcount(
[Customer].[Customer].[Customer].members
*
[Date].[Calendar].[Calendar Quarter].members
, 100, measures.test) on 1
from [adventure works]

…which executes in a mete 2 seconds cold cache, then you’ll be right in thinking we need to find a better approach. The culprit is of course the strtomember function; using any of the StrToX functions in a calculation is surefire way of killing query performance. Unfortunately because MDX parameters return strings (maybe we could get typed parameters in some future release?) we have to use strtomember to cast the uniquename string our parameter is returning to a member. The best way around this I’ve found is to create a named set in the With clause and parameterise that instead: it only gets evaluated once, and after that you can reference that set in your calculation so:

with
set mymeasure as {strtomember("[Measures].[Internet Sales Amount]")}
member measures.test as
mymeasure.item(0).item(0)
select measures.test on 0,
topcount(
[Customer].[Customer].[Customer].members
*
[Date].[Calendar].[Calendar Quarter].members
, 100, measures.test) on 1
from [adventure works]

This runs in a much more respectable 7 seconds on my laptop. But there’s one more trick you can use though, our old friend non_empty_behavior. If we set it to the measure whose value we’re actually displaying like this:

with
set mymeasure as {strtomember("[Measures].[Internet Sales Amount]")}
member measures.test as
mymeasure.item(0).item(0), non_empty_behavior= strtomember("[Measures].[Internet Sales Amount]")
select measures.test on 0,
topcount(
[Customer].[Customer].[Customer].members
*
[Date].[Calendar].[Calendar Quarter].members
, 100, measures.test) on 1
from [adventure works]

We’re back down to 3 seconds on a cold cache.

Written by Chris Webb

February 26, 2008 at 9:06 pm

Posted in MDX

Calculated members vs assignments to real members

with 5 comments

One design technique I see used quite frequently on my travels, usually in financial applications and usually by people who have experience with OLAP tools other than AS, is the creation of members on a dimension which have no real data associated with them in the fact table and whose value is later overwritten with an assignment in the MDX Script. Imagine the following scenario, for example: you have a chart of accounts dimension which has a member Total Sales, which represents the total sales of all of the operating units in your company; you also want another member, Sales Plus Tax, which is the Total Sales value multiplied by a constant representing the rate of tax. To implement this you could either create a calculated member on the Account dimension which did the calculation, or in your Account dimension table create an extra real member and then in the MDX Script use a scoped assignment to write the value of the calculation to that member.

Creating the calculated member is the simplest option, so why not do that? Well, the ‘real member’ option has several advantages, such as:

  • You have absolute control over where the member appears in the hierarchy, unlike with a calculated member
  • Real members can have children, calculated members can’t
  • Real members can have member properties associated with them, calculated members can’t
  • It’s easier to manage all of your dimension members in one place, ie in the dimension table itself
  • It’s easier to write scoped assignments which cover only real members; writing scoped assignments that cover multiple calculated members can be a pain (see my blog entry here for example).
  • Some client tools (ok, Excel 2007) have problems with displaying and selecting calculated members on non-measures dimensions. For another practical example of the ‘real member’ technique, see Marco Russo’s blog entry on using it to create a time utility dimension here to get around exactly this issue.

Hmm, so that’s great, are there any disadvantages to using it? I’m going to have to be vague here, unfortunately, because I don’t understand what’s going on inside AS to be able to say for sure, but in many cases where I’ve seen the ‘real member’ technique used it’s been associated with poor query performance. Even where the calculation involved has been trivial and there has been no further aggregation of the resulting value involved, I’ve seen situations where queries are slow and Profiler has gone nuts. It certainly doesn’t happen for every calculation and every query, but when it has and I’ve created a calculated member that contains the same calculation performance has been much better. I wonder what’s going on here? Maybe someone from Microsoft can comment? As you can probably guess, I’m writing this while onsite with a customer experiencing exactly this problem and I’m hoping to provoke some discussion on this topic…

In the meantime, all I can say is that be wary of using the ‘real member’ technique and test to see if a calculated member performs better. And hopefully in Katmai+1 we’ll see some of the limitations of calculated members addressed so the provide a stronger alternative.

Written by Chris Webb

February 23, 2008 at 11:04 pm

Posted in MDX

CTP6 and CU6 are now available

with one comment

The February CTP of SQL2008, aka CTP6 is now available for download here:
http://www.microsoft.com/downloads/details.aspx?familyid=749bd760-f404-4d45-9ac0-d7f1b3ed1053&displaylang=en&tm

Also now available, although not something you can download freely, is cumulative update 6 of SQL 2005 SP2:
http://support.microsoft.com/kb/946608/en-us

Several of those bugs there are ones I found (chest swells up with pride)… Meanwhile, in the SQL Server community, the calls for Microsoft to reverse their decision to not release a SP3 are getting louder. As I’ve said before, Analysis Services SP2 contained way too many bugs in my opinion and given that many shops either don’t know about CU releases, can’t get hold of them or won’t install anything other than an official service pack the fact that these bugs have since been fixed is irrelevant until the fixes get rolled up into an official service pack. It just sends a message to the customer that Microsoft is more concerned with shipping a profitable new release than making the product that everyone’s actually using at the moment (and most people will be using for a long time to come) more reliable.

Written by Chris Webb

February 20, 2008 at 4:35 pm

Posted in Analysis Services

Cell Security and the Formula Engine Cache

with 4 comments

I’ve been looking at a security implementation this week for a customer and investigating what impact it’s having on their cube’s performance. A number of interesting points have come up which are all worthy of a blog entry, but I thought I’d start with what I noticed about cell security. I’d heard from Mosha that cell security was just about the worst feature to use on a cube from a performance point of view because it stops the AS engine from knowing whether a cell is empty or not – and I can see this on my customer’s cube because cold cache queries take a lot longer to run for users who have cell security compared to users who don’t. But I also found another bad side-effect: it seems to stop the AS engine caching the results of calculations.

Here’s the steps I took to repro this in Adventure Works (I used the Simple version but I’m sure it works just the same on the full version):

  • Comment out the entire MDX Script apart from the Calculate statement
  • Add the following calculated measure to the Script:
    CREATE MEMBER CURRENTCUBE.MEASURES.[Test] AS
    [Measures].[Internet Sales Amount]+1;
  • Add a new role to the database and give it ‘Read’ permission on the Adventure Works cube
  • Create a new Windows user (without administrator permissions on the cube) and add it to the new role
  • Start a Profiler trace, making sure you include the ‘Get Data From Cache’ event.
  • Connect to the cube using SQL Management Studio as this user (I used the ‘Run As’ option on the right-click menu) and run the following query:
    SELECT {[Measures].[Internet Sales Amount], [Measures].[Test]} ON 0,
    [Date].[Calendar Year].MEMBERS ON 1
    FROM [ADVENTURE WORKS]
  • In Profiler you’ll notice everything’s as you’d expect when you’re running a query on a cold cache: the disk is being hit, data is being read from partitions.
  • Run the query again, ie on a warm cache
  • In Profiler you’ll see that this time the data used to answer the query is read from cache. There are four ‘Get Data From Cache’ events: two reading data from the measure group cache, which is raw data from the cube; and one each from the flat cache and the calculation cache, which are formula engine caches containing the results of calculations. For more information on these types of cache, see chapter 28 of "Microsoft SQL Server Analysis Services 2005".
  • Now, go back to your role and on the Cell Data tab check the Enable Read Permissions checkbox and enter the following expression:
    IIF(1=1, TRUE, FALSE)
    Incidentally, there’s a bug to watch out for here: sometimes you need to click the Edit MDX button and then OK on the resulting dialog to make BIDS aware that the cell security expression has actually been edited. Also although this is a pretty trivial expression, I found that I could not repro the behaviour if I just used the expression:
    TRUE
    for cell security. Clearly AS is able to work out that this expression always returns true, even if if can’t do the same for the first one!
  • Next, clear the cache and watching Profiler rerun the query twice.
  • Notice that on the second run, you now only see three ‘Get Data From Cache’ events and they are all for the measure group cache. So AS has been able to cache the raw data but not the results of the calculation.

If you imagine a scenario where there are hundreds of users, many connecting through roles with cell security, very complex MDX calculations and queries that take 5-15 seconds to run on a cold cache then you can imagine the kind of impact that cell security can have on performance. Queries that should run instantaneously on a warm cache are consistently taking almost as long to run as they did on a cold cache because the calculations have to be re-evaluated every time. So the moral of this tale is: don’t use cell security unless you absolutely have to.

Written by Chris Webb

February 19, 2008 at 4:46 pm

Posted in MDX

Radius90 from 90 Degree Software

leave a comment »

Continuing the theme of free stuff I’ve received, the nice people at 90 Degree Software have been chatting with me on the phone, inviting me to webcasts and sending me eval versions of their Reporting Services report generation tool Radius90 for, oohhh, over a year and a half now. So why haven’t I blogged about the tool yet (ok, I did mention it once but only briefly)? Because it didn’t support Analysis Services as a data source. But now with the release of Radius90 Version 2 it does at last, so it’s review time.

Radius90 is one of those tools that has appeared to meet the need for an end-user friendly means of creating Reporting Services reports. BIDS is all very well for developers but you wouldn’t give it even to a power user; while Report Builder is a bit simplistic in terms of the reports it can generate, its UI is confusing and as far as its support for Analysis Services goes it’s very poor (see this white paper for a complete list of the limitations of Report Builder on an AS data source). Radius90 gives you a nice Office 2007-style interface that combines the power of BIDS with the kind of ease-of-use that Report Builder was aiming at. I like it a lot, and to be honest it’s hard to see how you could come up with a better tool for the job. It’s so good, in fact, that anyone who’s seen what Microsoft have got planned for Report Builder in RS2008 (see this entry on Brian Welcker’s blog or this webcast if you haven’t) will notice some very strong similarities with Radius90. I guess, in the long-term, this could cause problems for 90 Degree but they’ve got enough time before Katmai arrives and people start migrating to put in place some more distinguishing features, and as it is they already have some nice collaboration functionality whereby you can reuse pieces of your own and other people’s reports via a peer-to-peer network plus a good extensibility story.

What about their Analysis Services support? I found the following video which demos this and also acts as a good basic introduction to the report design process:

As you’ll see if you watch the video, MDX queries are generated with a wizard and while it’s easy to use I’d have preferred to see a drag-and-drop interface where you can see the results your query returns as you build it – they’ve taken a much more relational database-type approach rather than an OLAPy one, and I guess it’s down to what your users are comfortable with and not a major issue anyway. Once the wizard has completed you don’t seem to be able to change the query except through editing the MDX (or at least I haven’t worked out how to, I could be wrong) which is a bit of a pain, although it’s pretty quick to delete a query and create a new one and binding a query and the fields in it to table is very easy to do. It doesn’t enforce the whole ‘only put measures on columns’ nonsense when you supply your own MDX, and I got all excited when I used a query which had a non-measures dimension on columns and it not only worked but gave me comprehensible column names, but as soon as I tried to crossjoin more than one dimension on columns it soon broke down… hohum. Of course the users that this tool is aimed at aren’t going to be writing their own MDX anyway, so again I’m not too bothered about this. I also noticed that query slicing has been implemented using subcubes rather than the WHERE clause so that any calculated members that rely on looking at the currentmember on a hierarchy that you’re slicing by won’t work properly – it’s a fairly common mistake and one that Mosha mentions here, but I’ve reported it to them and I’m sure it’ll get fixed pretty soon. It’s the only real showstopper I found, though, and in general the MDX it produces is clean and efficient. UPDATE: it turns out I was using a pre-release version and this problem has now been fixed – sorry…

Overall, then, Radius90 is definitely recommended. Its AS support is still a bit immature but even in its present state it’s still adequate for most tasks; perhaps going forward they can make sophisticated support for AS data sources one of their selling points over Report Builder?

Written by Chris Webb

February 13, 2008 at 11:23 am

Posted in Reporting Services

BI Survey 7 Findings

with 7 comments

One of the benefits of being a blogger is all the free stuff that gets sent your way by people looking for some publicity. Unfortunately in my case the stuff I get isn’t all that sexy, ie no free XBoxes, but it’s still interesting – the books are always welcome, the free licences for AS client tools are useful too, and yesterday I got a freebie copy of the results of Nigel Pendse’s "BI Survey 7" with an invitation to blog about the contents. Now I’m sure over the next few weeks you’ll see the marketing machines of all the BI vendors crank into action, cherry-picking the findings so they can say things like "XSoft is the most reliable BI platform" and "YWare offers the best query performance" etc etc. But what does it actually have to say about Analysis Services?

Firstly, some characteristics of Analysis Services deployments. This year 34% of respondents were still using AS2K, with 66% now on AS2005 (with no-one using OLAP Services any more). This is almost a complete inversion of the usage figures from the previous survey, indicating most people migrated from AS2K to AS2005 in the last year. This is certainly what I’ve seen in my consultancy work, but I’m still surprised so many people have migrated given that the two platforms are so different. Another interesting point made is that while you’d expect AS to be the dominant OLAP tool when SQL Server is the data source, it’s also the most common OLAP tool for Oracle, Sybase and DB2 sites and it comes in second (behind Microstrategy) for Teradata. Oracle I can understand, because I work with Oracle data sources as much as I work with SQL Server (though in the past I used to work more often with Oracle, in fact) but I’ve never yet worked with anyone using DB2 or Teradata and only one person using Sybase, which isn’t in fact officially supported as a data source for AS. Maybe I don’t get around enough. I bet this is uncomfortable reading for Oracle, especially, since the only Oracle BI product that features in the top ten BI tools used against an Oracle data source is Essbase and that’s at #10.

One of the things that people often bring up about AS is that people only use it because it’s bundled free with SQL Server. That’s certainly a big part of why it’s chosen, and the survey shows it’s very common that when it’s used it doesn’t go through a formal evaluation process or that it’s the only product evaluated. But that certainly doesn’t explain all the Oracle, IBM and Sybase shops that use it and the survey also shows that when a formal, multi-product eval is conducted then AS wins 75% of the time – putting it ahead of all of its major competitors. Again, that’s consistent with my experience: I was working with a customer recently where they’d done an eval comparing AS with Oracle’s OLAP option and Essbase, and neither of the latter could handle the data volumes and dimension sizes that AS2005 could.

Regarding the problems faced by BI projects, it’s not surprising that slow query performance is the most common across the board for all products. AS is slightly above average in that 20% of respondents complained about this (maybe they need to get a specialised consultant on board to do some tuning? Now, who could do a job like that?); TM1 does the best at only 6.6%, SAP BI does worst at 37.5%. One thing that AS does worse on than average is ‘Security Weaknesses’, which to be honest is a bit strange given that I’ve never found any holes in the product that would lead to values being shown to people who shouldn’t see them. What I suspect is happening here is that although AS has a great set of options for security it’s still way too difficult to configure, especially for complex scenarios (see here for my experiences, although I wonder whether other products could meet these requirements at all) and for large numbers of users. There’s a real need for a better, more end-user friendly interface for managing roles, and perhaps it would be good to be able to use dynamic security right out of the box rather than have to implement it manually.

Lastly, there’s a good section on the client tools used with AS. I cannot believe that as many as 6.6% of people are still using Data Analyzer as a client tool – what planet are they living on? The high percentage using pivot tables I can understand, and there’s a very long tail of people using tools that have only got 0-4% of market share. The survey makes the interesting point that fewer AS users use any kind of Excel front-end than users of TM1, Essbase or SAP BI; maybe that’ll change as more companies move to Office 2007.

Anyway, hopefully I won’t get into trouble for divulging too many details but I can honestly say it’s a fascinating read and at 420 pages long there’s a lot that I haven’t mentioned here, and it’s more substantial than a lot of IT books. If you’re a vendor of any kind I’m sure you’ll be buying this anyway, but I’d also recommend it if you’re a consultancy (lots of juicy facts to quote to prospective customers to trash the competition) or you’re about to embark on a BI project in-house. Overall, AS comes out of it very well which is obviously good news for me and I guess the majority of people who read this blog. Of course Nigel has long been very positive about the MS BI stack, much more so than other analysts although Gartner has suddenly got very positive too. I don’t know about anyone else, but sometimes I get a bit miffed by the way some BI journalists talk about Cognos and Business Objects as serious platforms but never mention Microsoft at all… hopefully that’s starting to change.

Written by Chris Webb

February 8, 2008 at 4:20 pm

Posted in On the internet

Follow

Get every new post delivered to your Inbox.

Join 3,301 other followers