Archive for the ‘Random Thoughts’ Category
Indulge me for a moment, please. Let’s imagine that somewhere in Microsoft, someone is planning for SQL Server v.next and is considering investing in new features for SSAS Multidimensional (don’t laugh – I wouldn’t be writing this post if I didn’t think it was a possibility). What features should they be?
Before I answer that question, it’s worth pointing out that despite what you might think there has been some investment in SSAS Multidimensional over the last few years. This post lists what was new in SSAS 2012 Multidimensional; since then support for DAX queries has been added and, umm, the new Divide() function. This must have been a lot of work for someone – but why does it get overlooked? One reason: none of these changes have made much difference to the ordinary SSAS Multidimensional developer’s life. DAX query support is great if you’re one of the few people that uses the SharePoint version of Power View; shockingly, it still doesn’t work in Excel 2013 Power View yet (though I guess it will be the way the new Power BI connects to on-prem Multidimensional). NUMA support is great if you work for an investment bank and have vast amounts of data and a high-spec server, but that’s only about 0.1% of the installed base.
So from this we can learn that the main consideration when choosing new features to implement should be that they should be relevant to the majority of SSAS Multidimensional developers, otherwise they’ll be ignored and MS may as well have not bothered doing anything. To that we can add these other considerations:
- These features should provide compelling reasons to upgrade from earlier versions of SSAS to the new version
- While some features should be available in all editions, there should also be some features that encourage customers to upgrade from Standard Edition to Enterprise Edition
- There are a limited resources (time and developers) available and Power Pivot/SSAS Tabular will be the priority, so only a few features can be delivered.
- Features that are only there to support Power BI don’t count
With all of that borne in mind, here’s what I would choose to implement based on what I see as a consultant and from the popularity of particular topics on my blog.
Last-Ever Non Empty
One of the most popular posts I’ve ever written – by a gigantic margin – is this one on the last-ever non-empty problem. Given that so many people seem to come up against this, and that the MDX solution is complex and still doesn’t perform brilliantly, I think it should be built into the engine as a new semi-additive aggregation type. Since semi-additive measures are Enterprise Edition only, this would be my sole Enterprise Edition feature.
MDX Calculation Parallelism
Ever since I’ve been working with SSAS, people have always asked why the Formula Engine has been single-threaded. I understand why the SSAS dev team have ignored this question and instead concentrated on tuning specific scenarios: doing parallelism properly would be extremely difficult given the way MDX calculations can be layered over each other, and in plenty of cases it could lead to worse performance, not better. However I’m not asking for a ‘proper’ implementation of parallelism. I just want something dumb: a boolean property that you can set on a calculation that tells the Formula Engine to do this calculation on a separate thread. If it makes performance better then great; if not, then don’t set it. My guess is that even a crude implementation like this could make a gigantic difference to performance on many calculation-heavy cubes.
Drillthrough is one of those features that almost everyone wants to use, but for some reason has been left in a semi-broken state ever since 2005. Here’s what needs to change:
- It should work with calculated members. I don’t expect SSAS to understand magically how to work out which rows to display for any given MDX calculation, but I would like a way of specifying in MDX what those rows should be.
- Those stupid, ugly column names – SSDT should let us specify readable column names and let us have complete control over the order they appear in.
- Excel should allow drillthrough on multiselect filters.
This might seem a bit of a strange choice, and I suspect it may not be easy to implement, but another problem that I come across a lot in my consultancy is the ‘events-in-progress’ problem. I’ve blogged about solving it in MDX and DAX, as have many others. I would love to see a new ‘between’ dimension/measure group relationship type to solve this. In fact, competing OLAP vendor iccube already implemented this and you can see how it works on that platform here and here. My feeling is that this would open up a massive number of modelling opportunities, almost as many as many-to-many relationships.
And that’s it, four features that I think could make SSAS Multidimensional v.next a must-have upgrade. I’m not so naive to believe that any or all of these will be implemented, or even that we’ll get any new features at all, but who knows? If you have any other suggestions, please leave a comment.
When I first saw the announcement about Office Sway last week, I thought – well, you can probably guess what I thought. Does it have any potential for BI? After all, the Sway team are clearly targeting business users (as well as hipster designers and schoolchildren): look at the Northwest Aquarium and Smith Fashion Expansion samples, and notice that they contain tables, charts and infographics. What’s more, data storytelling is currently a very hot concept and Sway is clearly all about telling stories. Wouldn’t it be cool if you could have interactive PivotTables, PivotCharts and Power View reports from your Power BI site embedded in a Sway? It would be a much more engaging way of presenting data than yet another PowerPoint deck.
I have no idea whether any integration between Sway and Power BI is actually planned (I have learned not to get my hopes up about this type of thing), but even if it isn’t maybe someone at Microsoft will read this post and think about the possibilities… And isn’t this kind of collaboration between different teams supposedly one of the advantages Microsoft has over its competitors in the BI space?
PS I want a pink octopus costume just like the one that girl in the video has
Just a few months away from the tenth anniversary of my first post here, I’ve reached the milestone that is my 1000th blog post. If you’ve been with me since back then, thanks for reading! I have no idea how I managed to write so much – it’s an average of around two posts per week, which I certainly haven’t managed recently – but I suspect that the answer lies in the fact that I posted a lot of rubbish here in the early years that I’m embarrassed by now.
I can remember the day when I decided to start this blog quite well. It was just after Christmas so the office was quiet and I didn’t have much work to do; blogging was the cool new thing back in late 2004 and having discovered that Mosha had started a blog I thought it was something I should be doing too, so as not to be left behind. Microsoft had just launched its own blogging platform so I signed myself up. I didn’t think I would stick at it this long…
At first I thought I would just use it writing up solutions to common Analysis Services and MDX problems, so that I didn’t have to keep repeating myself when I was answering questions on the microsoft.public.sqlserver.olap newsgroup. I kept going, though, for a lot of other reasons:
- To remember what I’ve learned. If I didn’t write this stuff down I would forget it, and trust me, I’m always googling for old posts here. This also explains why there is very little overall structure or purpose to what I write about. Technical books need to cover a topic very methodically: start at the basics, explain all the concepts and functionality, not miss anything out, and so on. Here, if I learn something interesting and useful while at work, or helping someone on a forum, or while playing around with a new tool, I just need to write that one thing down and not worry about whether it fits into some greater plan.
- I also find that the act of writing up a problem or topic for a post helps me understand it better. To be able to explain a technical concept you first have to be sure you understand it properly yourself, and writing for other people forces you to do that.
- To pass on Microsoft BI-related news. I work with these tools every day and so it’s natural that I want to find out what new toys I’ll have to play with in the future. I find this stuff interesting and fun, and it seems like there are several thousand other people around the world who also want to know what’s going on (even if we might not want to admit this publicly). I like airing my opinions too: sometimes Microsoft does things I agree with, sometimes it does things I think are crazy, and since my career and business is wholly dependent on Microsoft BI I think the occasional bit of public feedback is healthy and allowable. Brent Ozar sums up my feelings on this subject perfectly here. I’ve got in trouble once or twice for things I’ve written, but I’ve never regretted writing any of my posts.
- It’s marketing for my consultancy and training. I have to make a living somehow, and if I didn’t blog then it would be much harder to find customers – I think my blog is much more valuable in this respect than writing books or speaking at conferences or user groups. I don’t want to sound cynical, though, and I don’t see this blog as something that is purely commercial. I love to share and it just so happens that sharing my knowledge is also good for business. Some two years after starting this blog, just after I resigned from my permie job to become a self-employed consultant, one of my soon-to-be ex-colleagues said to me “You know, you’ll have to stop blogging now: why would anyone hire you if they can read everything you know on your blog for free?”. I didn’t have a good answer for him at the time but I soon found that if someone finds the answer to a problem on my blog, they are much more likely to think about hiring me when they have a problem they can’t solve. What’s more, I firmly believe that the way that people in the SQL Server community share knowledge publicly, even when they are aware that this knowledge could be used by their competitors, means that the community as a whole is stronger, SQL Server is more successful, and we all benefit more commercially than if we had not shared in the first place.
- I enjoy writing so I’m quite happy to spend my spare time writing blog posts. There’s no way I could have forced myself to write a thousand posts if I didn’t enjoy doing it. I also travel a lot for work, so that results in a lot of time spent in airports and hotel rooms with nothing better to do. To make another comparison with writing tech books: a tech book has to be objective, impartial, polished, structured, sober and impersonal, whereas a blog is (or at least in my opinion should be) personal, subjective, haphazard, rough-edged and sometimes controversial. This makes blogging less of an effort and more of a pleasure.
- Finally, I admit it, I get a kick out of knowing that when I write something there are people out there who want to read it.
Will I make it to my 2000th post? I have no idea, but I probably will if Microsoft are still making BI tools and I’m still using them.
Some time ago I received a review copy of a book called “Ethics Of Big Data” from O’Reilly; I didn’t get round to writing a review of it here for a number of reasons but, despite its flaws (for example its brevity and limited scope), it’s worth reading. It deals with the ethics of data collection and data analysis from a purely corporate point of view: if organisations do not think carefully about what they are doing then
“Damage to your brand and customer relationships, privacy violations, running afoul of emerging legislation, and the possibility of unintentionally damaging reputations are all potential risks”
All of which is true, although I think what irked me about the book when I read it was that it did not tackle the wider and (to my mind) more important question of the social impact of new data technologies and their application. After all, this is what you and I do for a living – and I know that I haven’t spent nearly enough time thinking these issues through.
What prompted me to think about this again was a post by Adam Curtis which argues that the way that governments and corporations are using data is stifling us on a number of levels from the personal to the political:
“What Amazon and many other companies began to do in the late 1990s was build up a giant world of the past on their computer servers. A historical universe that is constantly mined to find new ways of giving back to you today what you liked yesterday – with variations.
Interestingly, one of the first people to criticise these kind of “recommender systems” for their unintended effect on society was Patti Maes who had invented RINGO. She said that the inevitable effect is to narrow and simplify your experience – leading people to get stuck in a static, ever-narrowing version of themselves.
Stuck in the endless you-loop.”
Once our tastes and opinions have been reduced to those of the cluster the k-means algorithm has placed us in we have become homogenised and easier to sell to, a slave to our past behaviour. Worse, the things we have in common with the people in other clusters become harder to see. Maybe all of this is inevitable, but if there is going to be an informed debate on this then shouldn’t we, as the people who actually implement these systems, take part in it?
Today marks eight years since my first ever post on this blog, and every year on this date I write a review of what’s happened to me professionally and what’s gone on in the world of Microsoft BI in the previous year.
For me, 2012 has been yet another busy year. The SSAS Tabular book that Marco, Alberto and I wrote – “SQL Server Analysis Services 2012: The BISM Tabular Model” – was published in July and has been selling well, and the balance of my consultancy and training work has started to move away from Multidimensional and MDX towards Tabular, PowerPivot and DAX. It’s always exciting to learn something new and, frankly, the lack of any significant new functionality in Multidimensional and MDX has meant they have got a bit boring for me; at the same time, though, moving out of my comfort zone has been disconcerting. It seems like I’m not the only Microsoft BI professional feeling like this though: the most popular post on my blog by a long chalk was this one on Corporate and Self-Service BI, and judging by the comments it resonated with a lot of people out there.
Whether or not Microsoft is neglecting corporate BI (and I’m not convinced it is), it’s definitely making a serious investment in self-service BI. The biggest Microsoft BI release of this year was for me not SQL Server 2012 but Office 2013. That’s not to say that SQL Server 2012 wasn’t a big release for BI, but that Office 2013 was massive because of the amount of functionality that was packed into it and because the functionality was so well executed. You can read this post if you want details on why I think it’s significant, but I’ve really enjoyed playing with Excel 2013, PowerPivot, Power View and Office 365; there’s more cool stuff in form of Mobile BI, GeoFlow and Data Explorer coming next year, all of which are very much part of the Office 2013 story too. No Microsoft BI professional can afford to ignore all this.
The other big theme in Microsoft BI this year, and indeed BI as a whole, was Big Data. I reckon that 90% of everything I read about Big Data at the moment is utter b*llocks and as a term it’s at the peak of its hype cycle; Stephen Few has it right when he says it’s essentially a marketing campaign. However, as with any over-hyped technological development there’s something important buried underneath all the white papers, and that’s the increasing use of tools like Hadoop for analysing the very large data sets that traditional BI/database tools can’t handle, and the convergence of the role of business analyst and BI professional in the form of the data scientist. I’m still not convinced that Hadoop and the other tools that currently get lumped in under the Big Data banner will take over the world though: recently, I’ve seen a few posts like this one that suggest that most companies don’t have the expertise necessary for using them. Indeed, Google, the pioneer of MapReduce, felt the need to invent Dremel/BigQuery (which is explicitly referred to as an OLAP tool here and elsewhere) to provide the easy, fast analysis of massive datasets that MapReduce/Hadoop cannot give you. My feeling is that the real future of Big Data lies with tools like Dremel/BigQuery and Apache Drill rather than Hadoop; certainly, when I played with BigQuery it clicked with me in a way that Hadoop/HDInsight didn’t. I hope someone at Microsoft has something similar planned… or maybe this is the market that PDW and Polybase are meant to address? In which case, I wonder if we’ll see a cloud-based PDW at some point?
I read an interesting article by Stephen Swoyer today on the TWDI site today, about a new Gartner report that suggests that companies should start selling the data they collect for BI purposes to third parties via public data marketplaces. This is a subject I’ve seen discussed a few times over the last year or so – indeed, I remember at the PASS Summit last year I overheard a member of the Windows Azure Marketplace dev team make a similar suggestion – and I couldn’t resist the opportunity to weigh in with my own thoughts on the matter.
The main problem that I had with the article is that it didn’t explore any of the reasons why companies would not want to sell the data they’re collecting in a public data marketplace. Obviously there are a lot of hurdles to overcome before you could sell any data: you’d need to make sure you weren’t selling your data to your competitors, for example; you’d need to make sure you weren’t breaking any data privacy laws with regard to your customers; and of course it would have to be financially worth your while to spend time building and maintaining the systems to extract the data and upload it to the marketplace – you’d need to be sure someone would actually want to buy the data you’re collecting at a reasonable price. Doing all of this would take a lot of time and effort. The main hurdle though, I think, would be disinterest: why would a company whose primary business is something else start up a side-line selling its internal data? It has better things to be spending its time doing, like focusing on its core business. If you sell cars or operate toll roads why are you going to branch out into selling data, especially when the revenue you’ll get from doing this is going to be relatively trivial in comparison?
What’s more, I think it’s a typical piece of tech utopianism to think that data will sell itself if you just dump it on a public data marketplace. Maybe apps on the Apple App Store can be sold in this way, but just about everything else in the world, whether it’s sold on the internet or face-to-face, needs to be actively marketed and this is something that the data generators themselves are not going to want to make the effort to do. As I said earlier, those companies that are interested in selling their data will still need to be careful about who they sell to, and the number of potential buyers for their particular data is in any case going to be limited. Someone needs to think about what the data can be used for, target potential customers and then show these potential customers how the data can be used to improve their bottom line.
For example, imagine if all the hotels around the Washington State Convention Centre were to aggregate and sell information on their bookings for the next six months into the future to all the nearby retailers and restaurants, so it was possible for them to predict when the centre of Seattle would be full of wealthy IT geeks in town for a Microsoft conference and therefore plan staffing and purchasing decisions appropriately. In these cases a middle man would be required to seek out the potential buyer and broker the deal. The guy that owns the restaurant by the convention centre isn’t going to know about this data unless someone tells him it’s available and convinces him it will be useful. And just handing over the data it isn’t really good enough either – it needs to be used effectively to prove its value, and the only companies who’ll be able to use this data effectively will be the ones who’ll be able to integrate it with their existing BI systems, even if that BI system is the Excel spreadsheet that the small restaurant uses to plan its purchases over the next few weeks. Which of course may well require outside consultancy… and when you’ve got to this point, you’re basically doing all of the same things that most existing companies in the market research/corporate data provider space do today, albeit on a much smaller scale.
I don’t want to seem too negative about the idea of companies selling their data, though. I know, as a BI consultant, that there is an immense amount of interesting data now being collected that has real value to companies other than the ones that have collected it. Rather than companies selling their own data, however, what I think we will see instead is an expansion in the number of intermediary companies who sell data (most of which will be very small), and much greater diversity in the types of data that they sell. Maybe this is an interesting opportunity for BI consultancies to diversify into – after all, we’re the ones who know which companies have good quality data, and who are already building the BI systems to move it around. Do public data marketplaces still have a role to play? I think they do, but they will end up being a single storefront for these small, new data providers to sell data in the same way that eBay and Amazon Marketplace act as a single storefront for much smaller companies to sell second-hand books and Dr Who memorabilia. It’s going to be a few years before this ecosystem of boutique data providers establishes itself though, and I suspect that the current crop of public data marketplaces will have died off before this happens.
After yesterday’s stream of consciousness on how PowerPivot could be used in SSRS, here’s a follow-up post on how PowerPivot and ‘traditional’ SSAS could be integrated. Hold on, you say, surely that’s a no-brainer? Surely all that would need to happen would be that Vertipaq would become a new storage mode inside SSAS, along with MOLAP, ROLAP and HOLAP, and everyone would be happy? Well, maybe. But here’s alternative idea that I bounced off some friends a while back and got good feedback on, which I thought I’d air here.
Before I go on, let me state my position on some things:
- I like PowerPivot, and the more I use PowerPivot the more I like it.
- I really like the power of the Vertipaq engine, and I want to be able to use it in a corporate BI environment.
- I really like DAX, and I want to be able to use it in a corporate BI environment.
- BUT SSAS as we have it today is a very mature, rich tool that I don’t want to lose. PowerPivot models will always be a little rough-and-ready; a good SSAS cube is a lot more ‘finished’ and user-friendly (I always liken building a cube to building a UI). SSAS dimension security is, for example, an absolute killer feature in many corporate BI solutions; PowerPivot won’t have anything like this until at least the next version, whenever that will be.
- I also love MDX and I don’t want to lose it. MDX Script assignments, calculated members on non-measures dimensions, all of the things that PowerPivot and DAX can’t do (and probably won’t ever do) are things that I use regularly and in my opinion are essential for many complex, enterprise BI implementations.
- I don’t want the dev team to abandon corporate SSAS, and neither do I want the dev team to waste time re-implementing things in PowerPivot that we already have in corporate SSAS. Already people are asking when they can have security and partitioning in PowerPivot. I want new stuff though!
So, like all users I want absolutely everything possible. How could it be done? Here’s my basic idea: let us be able to build regular SSAS cubes using PowerPivot models as data sources, with SSAS working in something similar to ROLAP mode so every request for data from the cube is translated into an MDX (or SQL – remember SSAS, and presumably PowerPivot, supports a basic version of SQL) query against the PowerPivot model.
In more detail, let’s imagine we have an instance of SSAS running in Vertipaq mode and an instance of SSAS running in normal mode. You’d be able to do the following:
- Fire up BIDS and create a new SSAS project.
- Create a data source, which was a PowerPivot database on your Vertipaq instance of SSAS.
- Create a new Data Source View, which showed all of the tables in your PowerPivot database already joined. Alternatively, here I can imagine connecting to other data sources like SQL Server, creating a Data Source View as normal and then taking the DSV and deploying it as a PowerPivot model onto the Vertipaq instance of SSAS. So in effect, the DSV designer becomes a development environment for PowerPivot models.
- Create a regular SSAS cube in the usual way, only using the PowerPivot tables in the DSV.
- Set the storage mode of your dimensions and partitions to the new ROLAP-like storage mode; each SSAS partition could then be based on a separate PowerPivot table. This would mean that when you queried the cube, the SSAS instance issued MDX or SQL queries against the Vertipaq instance of SSAS, just as it issues SQL queries in ROLAP mode today. I suppose though there would be an overhead to making an out-of-process call, so maybe it would be better if you only had one instance of SSAS that could host both Vertipaq and regular SSAS databases at the same time, so all these requests could stay in-process.
The first, obvious, point here is that with this approach we get the traditional, rich SSAS cubes that we know and love and the raw speed of Vertipaq. So one objective is achieved. But I think there would be a lot of other benefits:
- You’d get two cubes for the price of one: the PowerPivot cube and the SSAS cube. You could choose which one to query depending on your needs.
- The ability to turn DSVs into PowerPivot models also gives you a proper development environment for creating PowerPivot models, integrated with BIDS and Visual Studio (so you also get source control). The current Excel-based UI is all very well, but us developer types want a nice visual way of creating relationships between tables.
- You’re able to use all of the new data sources that PowerPivot can work with in traditional SSAS. Imagine being able to create a planning and budgeting solution where users wrote values into an Excel Services spreadsheet, which then fed into PowerPivot via the new Excel Services REST API, which then in turn fed into a SSAS planning and budgeting cube complete with custom rollups and all the complex financial calculations you can only do in MDX.
- If your users have already built an existing PowerPivot model that they like and want to turn into an ‘official’ BI solution, you can very easily take that model as the starting point for building your cube by importing it into a DSV.
- It would also make it relatively easy to upgrade existing SSAS projects to use PowerPivot storage – you’d just convert your existing DSV into a PowerPivot model.
- SSAS drillthrough would be much, much faster because you’d be drilling through to the PowerPivot model and not the underlying relational source.
- You’d also have the possibility of working in something like HOLAP mode. Vertipaq may be fast, but with really large data volumes some pre-calculated aggregations are always going to be useful.
- You could define calculated measures in DAX in the PowerPivot model, and then expose them as measures in the SSAS cube. Probably you’d need some special way of handling them so they didn’t get aggregated like regular measures, but in some cases you’d want to take a calculated measure and sum it up like a regular measure (kind of like SQL calculations defined in named calculations today); many more calculations, like year-to-dates, can be treated as semi-additive measures. Effectively this means you are performing some multidimensional calculations outside the Formula Engine, in the SSAS Storage Engine (which in this case is PowerPivot), in the same way I believe that measure expressions work at the moment.
- For such additive and semi-additive calculations, it also opens up the possibility of parallelism since these calculations can be done in parallel in each partition and the result summed at the end. It also means you get the option to use either DAX or MDX, and can choose the right language for the job.
- There’s no duplication of dev work needed. For users of PowerPivot who want features like security, partitioning or parent/child relationships, you tell them they have to upgrade to regular SSAS; PowerPivot becomes something like SSAS Express. For users of SSAS who want the speed of Vertipaq, you tell them they have to use a PowerPivot database as their data source. The two complement each other nicely, rather like twins… now where have I heard that analogy before?
- You also have a convincing story for BI professionals who are sceptical/hostile to PowerPivot to win them over: traditional, corporate SSAS does not go away but is able to build on the new features of PowerPivot.
So there we have it, another fantasy on the future of the MS BI stack sketched out. You may be wondering why I’ve taken the time to write these two posts – after all, I don’t work for Microsoft and I’m sure plenty of people on the dev team have their own ideas on what features they want to implement for Denali. Well, as the saying goes, if you don’t ask you don’t get! And with Kilimanjaro almost out of the door now’s the time to ask. If you agree with what I’ve said here, or you disagree, or you have a better idea, please leave a comment…