Data Explorer–Where Does The Real Work Get Done?
Yes, I know, I’ve been blogging a lot over the last week or so – I can’t help it, there’s been so much interesting stuff to blog about! Anyway, here’s yet another post about Data Explorer; not about functionality this time, but the result of some investigations about how it works and some ideas that resulted from that.
After having played around with Data Explorer so much, I started to wonder just how scalable it is. In the course of doing this I found out that while Data Explorer can (and must) do a lot of the data processing logic itself, it pushes as much of that logic back to the data source whenever it can. For example, if you connect to the Adventure Works DW database in SQL Server and then implement some simple steps in a Data Explorer query that:
- Import everything from the DimDate table, hiding everything except the DayNumberOfWeek column
- Filter it so that DayNumberOfWeek is greater than 5:
= Table.SelectRows(DimDate, each [DayNumberOfWeek] > 5)
- Return the count of the number of rows that are returned:
…you can see if you run a Profiler trace that all of this logic is translated to SQL and not evaluated in Data Explorer itself. Here’s the final SQL statement generated by the third step above:
select count(1) as [$Item]
from [dbo].[DimDate] as [$Table]
) as [_]
where [_].[DayNumberOfWeek] > 5
) as [$Table]
I’m not sure which other data sources this is possible for, but it’s a good thing to see – you’re more likely to get good performance by doing the work back in the database than on your desktop. I wonder if it does this for Hive queries? If so, that would make it very useful for working with Hadoop/HDInsight.
And what about data sources where this isn’t possible though, like csv? I decided to test it out by loading a very large (1.4 GB) csv file I have with two columns containing integer values and 86 million rows, filtering it so that one of the column values was not null and less than 500000, and finding the resulting count of rows. Unfortunately I had to kill the query refresh after 15 minutes… In contrast, loading the same file into the Excel Data Model took almost exactly 6 minutes, and a DAX measure to do the same count/filter operation returned instantly. Not really a surprise I suppose, but it would be nice to have the option when working with csv files to load data into the Excel Data Model before you did anything else and then, when you did your transforms in Data Explorer, have those translated back to DAX queries and calculations. Alas, Data Explorer doesn’t support the Excel Data Model, PowerPivot or SSAS as data sources at the moment – I hope this changes soon.
This leads me on to the observation that in a lot of self-service BI scenarios the distinction between ETL and reporting can be blurred. If you just want to see a simple table showing your sales broken down by product in an Excel spreadsheet, you could build this quite easily with Data Explorer and not go near PowerPivot or the Excel Data Model. It would probably be easier to build too, since Data Explorer allows you to break complex operations into multiple steps, whereas putting all your logic in a single DAX expression can be mind-bendingly difficult (I’m reminded of this post I wrote two years ago). On the other hand you can do a lot of ETL-type things inside PowerPivot with calculated columns and measures; people have also been asking for the ability to use tables in the Excel Data Model as a data source for other tables (see here for example), for ETL purposes – it can be much faster and more elegant to use DAX to do some types of ETL work than SQL or SSIS. In the future I’d love to see tighter integration between Data Explorer and the Excel Data Model/PowerPivot so that we can get the best of both worlds.