Archive for the ‘Random Thoughts’ Category
Adapting SQLIS to work with tuples and sets as a data source
NonEmpty
(Crossjoin(
Exists(
[Customer].[Customer].[Customer].Members
, {[Customer].[Country].&[United Kingdom], [Customer].[Country].&[France]})
, [Product].[Product].[Product].Members)
, [Measures].[Internet Sales Amount])
with
member measures.outputdemo as TupleToStr(([Customer].[Customer].Currentmember, [Product].[Product].Currentmember)
)
select {measures.outputdemo} on 0,
NonEmpty(
Crossjoin(
Exists(
[Customer].[Customer].[Customer].Members
, {[Customer].[Country].&[United Kingdom], [Customer].[Country].&[France]})
, [Product].[Product].[Product].Members)
, [Measures].[Internet Sales Amount])
on 1
from
[Adventure Works]
Microsoft, BI and Search
He said that Microsoft’s goal — but not its initial offering — would go beyond finding URLs and instead focus in on the specific information sought by Internet users.
"Generally these days what you get back is URLs, and based upon research 50 percent of the time you do a search you don’t get the URL you’re looking for," he said.
Holloway said that the promise of Microsoft’s search capability is to dig down.
For example, he said, potential home-buyers might find a group of houses in the price range and with the precise amenities they are seeking.
Or a surfer might find a restaurant with the kind of menu a diner wants in a particular geographic area.
Reward Beta Testers better!
Usage-Based Partitioning
The partitions in the Project REAL database seem to violate one of the basic best practices of SQL Server 2000. There is no data slice set for the partitions. In SQL Server 2000, partitions must have the data slice set so that the run-time engine knows which partition to access. This is similar to specifying a hint to a relational query optimizer. In SQL Server 2005, this is no longer necessary. Processing the partition now automatically builds a histogram-like structure in the MOLAP storage. This structure identifies which members from all dimensions are included in the partition. Thus, so long as the storage method is MOLAP, the data slice is an optional (and unused) property. However, the data slice is used with ROLAP storage or when proactive caching involves a ROLAP access phase. In both of these circumstances, the actual fact data is never moved so the system does not have a chance to identify a member. In this case, setting the data slice for the partition remains a necessary and critical step if you expect the system to perform well.
Because the MOLAP structures dynamically determine the data slice, a new type of partitioning technique is possible with SQL Server 2005. The best way to describe this technique is via a simple example.
Suppose a system that you are designing has a product dimension of 1,000 products. Of these, the top 5 products account for 80% of the sales (roughly evenly distributed). The remaining 995 products account for the other 20% of the sales. An analysis of the end-user query patterns show that analysis based on product is a common and effective partitioning scheme. For example, most of the reports include a breakdown by product. Based on this analysis, you create six partitions. You create one partition each for the top 5 products and then one “catchall” partition for the remainder. It is easy to create a catchall partition. In the query binding, add a WHERE clause to the SQL statement as in the following code.
In the top five partitions (1 through 5) use the following code.
SELECT * FROM <fact table>
WHERE SK_Product_ID = <SK_TopNthProduct#>
In the catchall partition use the following code.
SELECT * FROM <fact table>
WHERE SK_Product_ID NOT IN (<SK_TopProduct#>,
<SK_2ndTopProduct#>
<SK_3rdTopProduct#>
<SK_4thTopProduct#>
<SK_5thTopProduct#>)
This technique requires a lot of administrative overhead in SQL Server 2000 Analysis Services. In SQL Server 2000, the data slice must identify each and every member in the partition—even if there are thousands and thousands of members. To implement the example, you would need to create the catchall partition data slice with 995 members in it. This is in addition to the administrative challenge of updating that list as new members are added to the dimension. In SQL Server 2005 Analysis Services, the automatic building of the data slice in the partition eliminates the administrative overhead.
- First, get a log of all the queries that users are actually running. Unfortunately the Query Log in AS2005, like AS2000, doesn’t actually record the actual MDX of all the queries run. The only way to do this is to use Profiler; I was a bit worried about whether doing this would have an adverse impact on query performance but when I put the question to MS they indicated it shouldn’t be much (Richard Tkachuk also mentioned, as an aside, that turning off Flight Recorder should result in an increase of a few % in terms of query performance – a tip to remember for production boxes, I think). Once you’ve run your trace you can then export all of the MDX statements from it to a text file very easily.
- You’d then need a bit of code to extract the unique names of all the members mentioned explicitly in these queries. It should be a fairly simple task if you get the right regular expression, I think. Note that this ignores queries which use any kind of set expression – my thinking was that individually named members are going to be the most interesting because they’re going to be the ones which slice the queries the most: if users are querying on all the countries in the cube that’s not going to be any use for partitioning, but if they have a particular product in the WHERE clause that is much more useful to know about.
- Then you could do some data mining to cluster these members by their propensity to appear in a query together. The idea is that each of the resulting clusters would translate into a partition; those members which didn’t fall nicely into a cluster and those members that didn’t get extracted in step #2 would have their data fall into one of Dave’s ‘catch-all’ partitions. Imagine this scenario: the UK branch of the Adventure Works corporation suddenly finds there is massive demand for bikes after petrol (‘gas’, for you Americans) prices rise massively. As a result, analysts in the UK run lots of queries which are sliced by the Product Category [Bikes] and the Country [UK]. You’d hope that this pattern would emerge in the clustering and result in a single partition containing all the data for ([Bikes], [UK]), so in the future similar queries run much faster.
What does everyone think? There seems to be a lot of activity these days in the comments section of my blog, so I thought I’d invite feedback. Can anyone see a fatal flaw in this approach?
LinkShare: my idea for a $50000-prize winning app
As I said, I don’t have the time to enter the Connected Systems Developer competition that I blogged about the other week, but that hasn’t stopped me thinking about what I might build if I did enter. The following idea came to me at around 2am this morning when I was desperately trying to get my 22-month-old daughter to go back to sleep, and having nothing better to do this afternoon I thought I’d bounce it off anyone reading my blog. So comments are invited – even if they are just to say that it’s a rubbish idea and/or someone’s thought of it before and/or it’ll never work. I have after all categorised this post under ‘Random Thoughts’!
Business Case:
In the modern office everyone does a lot of web surfing; some of it might even be business-related. And whenever we see something interesting we typically copy the link into a mail, add a few words of explanation and send it on to a few people who might also want to have a look. I send at least two or three such emails a day. For the typically lazy web surfer, though, this process is a bit of a hassle so we only bother to do it when we think the link is really interesting and (because we don’t want to get a reputation as the office spammer) we only send it to a small number of people we know who we think are going to find it interesting too. It’s my contention that it would be cool if we could share more of these links with more people.
So, we need to solve three problems in our quest to share the interesting links we find during our daily surfing:
1) It needs to be easier to share the link once we’ve found it
2) We don’t want to send stuff on to people who aren’t going to be interested, and we want to receive only the links that we’re going to be interested in.
3) We’d like to be able to share links with people who we don’t actually know well enough to contact directly.
Of course there are plenty of existing ways that people share links, such as newsgroups, email discussion lists and blogs but they typically only address the third of the above problems fully, the second only partially and the first not very well at all. For instance, anyone reading my blog is presumably doing so because they’re interested in Microsoft’s BI tools and they’re going to be interested in any links to webcasts, articles etc that I post up, but if they’re like me they subscribe to upwards of a hundred rss feeds – and that’s only on subjects they’re really interested in – so we still have the proverbial information overload. The same goes for email discussion lists and newsgroups. And in all these cases in order to share information you have to open an email, write a blog post etc, which all require effort.
Let me give you an idea of the kind of scenario I want to tackle. This morning I was reading this story on the Register, and followed a link on a whim to this page, a set of pictures of Cybermen with funny captions. It brought a smile to my face but I didn’t send it on to anyone else because a) it didn’t seem worth the bother, and b) I didn’t know whether any of the people I usually send stuff onto were at least mildly into Dr Who in the way I am. I’m not going to blog about it because it’s not relevant to BI, I don’t subscribe to any Dr Who blogs, dls or newsgroups because I’m not that much of a Dr Who fan, and so no-one else is going to see it. Which is a shame.
Functional Spec:
Anyway, enough waffle about the theory. The solution I’m thinking of would consist of something like the following:
- An IE toolbar with only two controls on it: a button saying "This is a cool page" which you hit whenever you find an interesting link (regardless of whether you’ve found it yourself or it has been recommended to you by the system), and a textbox which allows you to add a short commentary on the contents of the page if you want. Whenever you hit the button, it sends the current url in your browser plus any comments to a web service which…
- …Puts the information in a queue on a server. There’s an app which gradually works its way through every link submitted, retrieves the page, strips the text from the html and does some funky text mining on this and the comments you’ve submitted, and classifies it. This is then used by…
- …Another server app which looks at your tastes (based on pages you’ve submitted in the past and perhaps other users who you’ve said you’re interested in seeing links from) and then using some more data mining gives you a short list of recently submitted links that you might be interested in, along with the comments of the people who have recommended them. This could either come in the form of a web page, a customised rss feed or a regular email newsletter.
So, in practice, let’s imagine it working as follows. Chris, Jon and Colin all work in a large corporation, in the same team doing the same kind of BI stuff. During his morning surfing, Chris submits 5-10 links; one, on a new feature of MDX, gets recommended automatically to Jon and Colin because everyone in the team works with MDX and has submitted MDX-related pages in the past. One, containing pictures of Cybermen with amusing comments, gets recommended only to Colin and only appears about halfway down his list because he’s a bit of a sci-fi fan and has submitted a few sci-fi links in the past. Meanwhile, David, who works in a different team and doesn’t know Chris, Jon or Colin finds a cool article on C-Omega and submits it so it gets recommended to the rest of his team; they all in turn click their buttons and so it eventually appears at the top of Jon’s list (because he’s really into coding) and somewhere down the list for Chris (because he’s not so into coding, but this is a really cool article nonetheless).
The larger the number of users with similar taste, the better it should work – more links submitted plus more people voting on the same links, and so the mining models can get to know people’s tastes much more quickly. I could imagine it doing well as an intranet app at a large tech company. It would probably need to give more priority to newer links (people want the latest stuff, and you don’t want old but popular links clogging up your recommendations) and maybe have some way of removing links you’ve already seen from your list of recommendations. One other extra feature that occurred to me was that the app could also generate a report showing the users who submitted the most interesting links, so as to generate a bit of rivalry and encourage future usage.
The key to it all though is the fact that all you need to do to submit a page is click a button in IE – the absolute minimum effort possible – and the fact that the job of the mining model is clear – recommend a page which will make you click your button in turn.
Technology:
It should be fairly straightforward to build the toolbar and the web service. Qualification for the competition comes with the use of SQL 2005 for storing all the data, SQLIS to do the processing, AS to do the data mining, and RS to do the web-based reports, daily email, even the rss feed (maybe as a custom rendering extension?). I’ll admit that I don’t know enough about data mining to know whether that bit will really work, but hey, it might.
OK, enough fantasising. If anyone does implement this and enters the competition, please can I have a share of the winnings?
Grouping in MDX – response to Mosha
I’m honoured by my mention in Mosha’s blog! But I think my entry on Grouping in MDX, and Mosha’s comments on it, need some further clarification and discussion.
First of all, CREATE SESSION CUBE. I agree it is probably the best way to implement grouping at the moment and that it works well in Excel, but it’s not ideal:
- Surely it’s a client-only solution? You’d want the option of creating these groupings on the server for all users, just as you would any other calculated member, and CREATE SESSION CUBE doesn’t offer you that, does it?
- I don’t know what the overhead of creating a session cube on the client is (or, in Yukon, on the server) – but is it as small as creating a calculated member? What is the impact on caching, especially given the fact that you have to drop and create the session cube when you change or add a new group?
- A lot of the time, for example in Reporting Services, you are only working with one-off queries rather than sessions and you really want to be able to create a group with query rather than session scope.
- The MDX needed to create the session cube and then manage it within the session seems overly complex for such a simple requirement.
Secondly, to address Mosha’s point on why you would need the VisualTotals and Aggregate functions in the same query: it’s because you’d want to use your new group member in any scenario where you could use a normal member, and that includes a query which used VisualTotals. Imagine you had a measure which showed the distinct count of customers across all your stores, and you wanted a report which had a) a single group member containing your top 5 stores, b) several other individual stores and c) showed the visualtotal of all the distinct customers in both the group and the individual stores displayed. I think that would be a reasonable requirement and one which wouldn’t be possible unless AS ‘knew’ what members went into the group.
Thirdly, sets in the WHERE clause (and also subcubes in the FROM clause, which do the same thing) inYukon. Unfortunately, this only works when you’re slicing by the group and not when the group is on a visible axis so it doesn’t fit the scenario I was describing.
Overall, then, CREATE SESSION CUBE is almost the functionality that I want but it doesn’t allow groups to be defined on the server. So we’re close…!
Grouping members together
One of the weaknesses of Analysis Services, in my opinion, is support for creating custom groupings of members. I reckon that 90% of all calculated members on non-measures dimensions must be doing just this, ie just doing an AGGREGATE or SUM over a set of members, and yes calculated members will return the right values but my complaint is something else. It’s that you then have no idea what members were aggregated together inside this calculated member, and that functions like VISUALTOTALS, NONEMPTYCROSSJOIN etc that you would like to be ‘group aware’ of course aren’t. Some examples needed, I think…
Consider the following query on Foodmart 2000:
WITH MEMBER [Customers].[All Customers].[USA].DEMO AS ‘AGGREGATE({[Customers].[All Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR]})’
SELECT MEASURES.MEMBERS ON 0,
{[Customers].[All Customers].[USA].CHILDREN, [Customers].[All Customers].[USA].DEMO} ON 1
FROM SALES
Wouldn’t it be nice, then if VISUALTOTALS ’knew’ what was in the set and this query
WITH MEMBER [Customers].[All Customers].[USA].DEMO AS ‘AGGREGATE({[Customers].[All Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR]})’
SELECT MEASURES.MEMBERS ON 0,
VISUALTOTALS({[Customers].[All Customers].[USA],{[Customers].[All Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR]} }, "* DEMO")
ON 1
FROM SALES
returned the same results as this query?
WITH MEMBER [Customers].[All Customers].[USA].DEMO AS ‘AGGREGATE({[Customers].[All Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR]})’
SELECT MEASURES.MEMBERS ON 0,
VISUALTOTALS({[Customers].[All Customers].[USA],[Customers].[All Customers].[USA].DEMO }, "* DEMO")
ON 1
FROM SALES
And that when you did a NONEMPTYCROSSJOIN against your calculated member, it would return the same results as when you did a NONEMPTYCROSSJOIN against the set that was aggregated in the calculated member? And perhaps also that you could drill down from the calculated member to see the members inside it?
Of course this isn’t possible at the moment, because a calculated member could contain any sort of calculation, so AS simply can’t make any assumptions. But if there was a special kind of group calculated member, which simply took a set of members as its definition and which always returned an AGGREGATE of that set, surely AS could make these assumptions?
Just a thought…
