Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for March 2013

Creating And Using Functions in Data Explorer (Power Query)

with 14 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

I’ve been reading through the Data Explorer Formula Language and Library specifications (which can be downloaded here) over the last few weeks, learning about what’s possible with it when you go beyond the UI. This will translate into a lot of Data Explorer blog posts in the coming weeks…! Here’s the first of those posts.

Today I’m going to look at how to create and use functions. Page 86 onwards in the Formula Language specification has a lot of detail about how to define a function in the M language that Data Explorer uses, but it doesn’t include any practical examples of how functions can be created and used in Data Explorer and Excel unfortunately. I’ve done some experimentation, though, and this post aims to fill that gap.

First of all, take a look at this Excel worksheet with two tables in it, called InputNumbers1 and InputNumbers2:

image

Let’s say that you want to create two Data Explorer queries that each use one of these tables as a source, and in both cases you want to add a third column that displays the product of the two existing columns. So, for example, for the first table you want a new column with the values 2, 12, 30 and 56 in.

Now this is a very simple calculation and certainly one that Data Explorer can handle easily, but let’s say that you want to perform this operation many times in many different queries and for obvious reasons you don’t want to have to duplicate the same logic in each query. You want to create a function!

To do this, you need to use the Write Query option from the From Other Sources button on the Data Explorer tab in the ribbon:

image

Then change the name of the new query to MultiplicationFunction by double-clicking on the query name in large letters at the top of the Edit Query dialog:

image

Then enter the following code as the only step:

= (x,y) => x * y

This defines a query with two parameters, x and y, and returns the product of x and y. Click Done to save the query. The query will show the following result in the worksheet:

image

You can ignore this new table, and you can disable the data load onto the worksheet, but it seems like you always need to have the table there and you certainly can’t delete it.

To use this new function create a new query from the table InputNumbers1 (shown above) using the From Table button. Then in the Edit Query dialog create a new step by selecting Insert Column/Custom:

image

And then use the new function to calculate each row in the new column, passing the two existing columns as parameters:

image

This results in the following step:

= Table.AddColumn(Source, “Custom”, each MultiplicationFunction([FirstNumber],[SecondNumber]))

image

And so you have your new column showing the product of the [FirstNumber] and [SecondNumber] columns. Then do the same thing for the InputNumbers2 table and you’ll have two Data Explorer queries now with custom columns in:

image

image

Finally, to make sure that the function is indeed being used, go and alter the definition of the function to be

= (x,y) => x / y

Refresh both of the other queries, and you’ll see the contents of the custom column have changed for both of them:

image

image

I can imagine that the ability to break functionality out into functions will reduce the complexity, and improvement the maintainability, of many ETL solutions that get built with Data Explorer in the future.

You can download my sample Excel workbook with this demo in here.

Written by Chris Webb

March 25, 2013 at 5:22 pm

UK/US Date Format Bug in PowerPivot and SSAS Tabular

with 4 comments

I don’t usually blog about bugs, but this one has been irritating me no end for the last year – so I thought it deserved some publicity…

In Excel 2010 PowerPivot and and in SSAS 2012 Tabular models (but not the Excel 2013 Data Model interestingly), if you have an English locale that is not US English (eg UK or Australian English), you may find that date columns appear to be formatted correctly as dd/mm/yyyy inside the PowerPivot window or in SSDT, but when you get to Excel you see the dates formatted in the US mm/dd/yyyy format. So, for example, on my laptop if I import the DimDate table from Adventure Works into Excel 2010 then I see dates formatted as dd/mm/yyyy as I’ve specified in the Formatting section of the ribbon in the PowerPivot window:

image

image

However, in an Excel PivotTable, I see dates formatted as mm/dd/yyyy:

image

There is a workaround though, which I found on the PowerPivot forum (thank you Steve Johnson, if you’re reading) – you can get the dates to format correctly if you go to More Date Formats and choose dd/MM/yy or one of the other formats from the dropdown list that appears:

image

image

Here are the correctly formatted dates in a PivotTable:

image

It seems like there is already a Connect open on this issue here, so please vote to get it fixed!

Written by Chris Webb

March 21, 2013 at 9:30 am

Finding Shakespeare’s Favourite Words With Data Explorer (Power Query)

with 23 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

Before I do that, though, some things to point out. First, there’s a new update of Data Explorer that appeared a few days ago – the Data Explorer team blog has the details. One of the new bits of functionality is a button that allows you to edit all of the expressions in your query at once:

image

Second, when you’re building a query, when you want to add a new step manually that refers to the previous step but doesn’t apply any calculations or transforms, you need to click on the fx button next to the expression on an existing step:

image

I’ve used this quite extensively to write custom steps that aren’t possible with the UI.

Here is the full code for my query to find the top 100 words:

Source = Lines.FromBinary(File.Contents("C:\Users\Chris\Documents\CompleteShakespeare.txt")),

RemoveLineBreaks = Lines.ToText(Source, " "),

RemovePunctuation =

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"'"

                            ,"@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"}),

Lowercase = Text.Lower(RemovePunctuation),

TurnIntoTable = Table.FromValue(Lowercase),

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {"Value", Splitter.SplitTextByWhitespace()}),

ListOfWords = SplitIntoWords{0}[Value],

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

RenameColumnToWord = Table.RenameColumns(TableFromList,{{"Column1", "Word"}}),

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <;> "")),

FindWordCounts = Table.Group(RemoveBlanks, {"Word"}, {{"Count", each Table.RowCount(_), type number}}),

SortedRows = Table.Sort(FindWordCounts,{{"Count", Order.Descending}}),

KeptFirstRows = Table.FirstN(SortedRows,100)

Broken down step-by-step:

Source = Lines.FromBinary(File.Contents(“C:\Users\Chris\Documents\CompleteShakespeare.txt”))

This loads the contents of CompleteShakespeare.txt into Data Explorer and breaks it up into lines of text:

image

RemoveLineBreaks = Lines.ToText(Source, ” “)

This removes all the line breaks and creates a single chunk of text:

image

RemovePunctuation = Text.Remove(RemoveLineBreaks,{“,” ,”.” ,”?” ,”;” ,”:” ,”;” ,”‘” ,”@” ,”#” ,”~” ,”{” ,”[" ,"}" ,"]” ,”(” ,”)”, “*”})

Removes any punctuation, and

Lowercase = Text.Lower(RemovePunctuation)

Turns all the words to lowercase:

image

TurnIntoTable = Table.FromValue(Lowercase)

Takes the text and creates a table with one column and one row containing the text:

image

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace()})

Takes the text and creates a List object with one entry for every word in the text.

ListOfWords = SplitIntoWords{0}[Value]

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error)

RenameColumnToWord = Table.RenameColumns(TableFromList,{{“Column1″, “Word”}})

Takes the list and turns it into a table with one column and one row for each word:

image

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <> “”))

Removes any rows containing blanks.

FindWordCounts = Table.Group(RemoveBlanks, {“Word”}, {{“Count”, each Table.RowCount(_), type number}})

Does a ‘group by’ to return a table with one row for each distinct word, and a column containing the number of occurrences of that word:

image

SortedRows = Table.Sort(FindWordCounts,{{“Count”, Order.Descending}})

Sorts this table in descending order by word count.

KeptFirstRows = Table.FirstN(SortedRows,100)

Only returns the first 100 rows from this table.

And at this point, we have the top 100 words used by Shakespeare (admittedly including the text at the beginning and end of the file inserted by Project Gutenberg that I couldn’t be bothered to remove):

image

No surprises here really – ‘the’ comes in at #1 with 18444 occurrences, “lord” comes in at #45, “king” at #69 and so on.

While you probably aren’t interested in Shakespeare for your own business purposes, it shows what would be possible if you wanted to analyse what was being said in Facebook updates, Tweets, product reviews and so on. It wasn’t all that hard to write either, especially considering that I’m fairly new to Data Explorer, so more sophisticated variations on this would certainly be possible without much extra effort. And it’s worth pointing out that it’s fast too: it takes a couple of seconds to refresh, and I’m not even sure I’ve done everything in the most efficient way possible.

You can download the sample workbook (but not the Complete Works of Shakespeare) here. If you want to point it at a different text file, just modify the first line of the code.

UPDATE: it turns out there is either a bug in PQ or I’m hitting one of its limits, and the counts here aren’t all correct. See the comments from Frank below for a version that works.

Written by Chris Webb

March 15, 2013 at 7:15 pm

The Dangers of Non_Empty_Behavior

with 7 comments

One thing I see quite often when I’m reviewing MDX is the incorrect use of the Non_Empty_Behavior property. 99% of the time it’s set without the developer really knowing what it does, and luckily in most cases it makes no difference to performance or what the calculation returns. However… in this post I’m going to argue that you should not use it unless you really know what you’re doing and you can prove that it’s beneficial – because if it’s set incorrectly it can sometimes mean you get incorrect results from your queries.

First of all, what does Non_Empty_Behavior actually do? Basically, it’s a hint to the SSAS Formula Engine that says that a calculation will return a null value if a given tuple also returns a null value. Moreover it’s a hint that may or may not be ignored. For example, if you had an MDX calculated measure that was defined as

CREATE MEMBER CURRENTCUBE.[Measures].[MyMeasure] as [Measures].[Internet Sales Amount]*2;

It would be true to say that this calculated measure would always return a null value when the real measure [Measures].[Internet Sales Amount] returned null, so you would be able to say:

CREATE MEMBER CURRENTCUBE.[Measures].[MyMeasure] as [Measures].[Internet Sales Amount]*2
, NON_EMPTY_BEHAVIOR=[Measures].[Internet Sales Amount];

The late, great Mosha covered its use in detail in a number of blog posts from a few years ago:
http://sqlblog.com/blogs/mosha/archive/2006/11/05/budget-variance-a-study-of-mdx-optimizations-evaluation-modes-and-non-empty-behavior.aspx
http://sqlblog.com/blogs/mosha/archive/2007/03/06/averages-ratios-division-by-zero-and-non-empty-behavior.aspx

These posts all refer to SSAS 2005, and from SSAS 2008 onwards a lot of work went on to ensure that most of the scenarios where Non_Empty_Behavior had been necessary to get good performance were optimised away. Certainly, in the simple example I showed above there would be no need to set Non_Empty_Behavior at all – the Formula Engine will already be able to optimise it as much as it can. This means that if you’re on SSAS 2005 then Non_Empty_Behavior can be very useful for tuning calculations, but if you are using SSAS 2008 or greater it’s highly likely that you do not need to use Non_Empty_Behavior at all. In fact, Mosha’s advice back in 2009 when he did a preconference seminar at the PASS Summit on SSAS performance tuning was summed up in these two slides:

image

image

I think the message here is pretty clear…

So, it probably won’t help your query performance, but why is it dangerous? Consider the following query:

WITH
MEMBER MEASURES.SomeTest AS
IIF(Now()>0, "Internet", "Reseller")
MEMBER MEASURES.Demo as
STRTOMEMBER("[Measures].[" + measures.sometest + " Sales Amount]")
SELECT
{[Measures].Demo} 
ON 0,
NON EMPTY
[Product].[Category].[Category].members
*
[Date].[Date].[Date].members
ON 1
FROM [Adventure Works]

This returns 1887 rows. However, if I set Non_Empty_Behavior incorrectly, like so:

WITH
MEMBER MEASURES.SomeTest AS
IIF(Now()>0, "Internet", "Reseller")
MEMBER MEASURES.Demo as
STRTOMEMBER("[Measures].[" + measures.sometest + " Sales Amount]")
, NON_EMPTY_BEHAVIOR=[Measures].[Reseller Sales Amount]
SELECT
{[Measures].Demo}
ON 0,
NON EMPTY
[Product].[Category].[Category].members
*
[Date].[Date].[Date].members
ON 1
FROM [Adventure Works]

The query now returns only 64 rows! By saying that the calculation will return a null value when Reseller Sales Amount is null – which is not the case – then rows that do in fact have values will get filtered out by the non empty clause. Clearly this is a bad thing, and also one that may be very hard to see happening when you are testing.

The example above was quite obviously wrong, and in fact it’s quite hard on SSAS 2012 to find a query where Non_Empty_Behavior returns incorrect results. More often I see calculations like this:

WITH
MEMBER MEASURES.PreviousSales AS
([Measures].[Internet Sales Amount], [Date].[Calendar].CURRENTMEMBER.PREVMEMBER)
, NON_EMPTY_BEHAVIOR=[Measures].[Internet Sales Amount]
SELECT {[Measures].[Internet Sales Amount], measures.PreviousSales} ON 0,
[Date].[Calendar].[Calendar Year].MEMBERS ON 1
FROM [Adventure Works]

…where Non_Empty_Behavior is set incorrectly (the calculation returns the Internet Sales Amount for the previous time period, and it’s not true to say that the previous time period’s sales will be null if the current time period’s sales are null) but the query still returns the correct results. This is because, luckily, Non_Empty_Behavior is being ignored.

So, to sum up, if you are using SSAS 2008 or greater Non_Empty_Behavior may not solve your performance problem and may result in incorrect results. That is not to say that there are still one or two isolated scenarios where it can still be useful. Jorg Klein came across one fairly recently and blogged about it here:
http://sqlblog.com/blogs/jorg_klein/archive/2013/02/18/implement-ssas-md-cell-security-using-dimension-security-with-blazing-performance.aspx

But these scenarios are few and far between. Therefore don’t use it if you’re not sure what it does!

Written by Chris Webb

March 11, 2013 at 1:30 pm

SQLDay 2013 Conference in Poland

with one comment

This is just a quick note to say that I’ll be speaking, and running a preconference seminar, at the SQLDay 2013 conference in Poland on the 22nd-24th May. I’m looking forward to it because I’ve never been to Poland before!

You can find all the details of my precon, “Real World Cube Design and Performance Tuning with Analysis Services”, here:
http://conference.plssug.org.pl/lang/en-us/agenda/sql-day-2013/sroda-22-05wednesday-22-05/w1-bi-track-real-world-cube-design-and-performance-tuning-with-analysis-services/

If you can’t make it to Poland, I’d love to see you at the PASS BA Conference (btw the discount code BAC521BL now gives you a $200 discount if you haven’t booked already), DatabaseDays Switzerland, SQLBits or one of my courses in London. Blimey, that’s a lot of teaching… and there’s more to be announced soon too.

Written by Chris Webb

March 7, 2013 at 4:47 pm

Posted in Events

Data Explorer (Power Query) –Where Does The Real Work Get Done?

with 4 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

Yes, I know, I’ve been blogging a lot over the last week or so – I can’t help it, there’s been so much interesting stuff to blog about! Anyway, here’s yet another post about Data Explorer; not about functionality this time, but the result of some investigations about how it works and some ideas that resulted from that.

After having played around with Data Explorer so much, I started to wonder just how scalable it is. In the course of doing this I found out that while Data Explorer can (and must) do a lot of the data processing logic itself, it pushes as much of that logic back to the data source whenever it can. For example, if you connect to the Adventure Works DW database in SQL Server and then implement some simple steps in a Data Explorer query that:

  1. Import everything from the DimDate table, hiding everything except the DayNumberOfWeek column
  2. Filter it so that DayNumberOfWeek is greater than 5:
    = Table.SelectRows(DimDate, each [DayNumberOfWeek] > 5)
  3. Return the count of the number of rows that are returned:
    = Table.RowCount(FilteredRows)

…you can see if you run a Profiler trace that all of this logic is translated to SQL and not evaluated in Data Explorer itself. Here’s the final SQL statement generated by the third step above:

select count(1) as [$Item]
from
(
select [_].[DayNumberOfWeek]
from
(
select [DayNumberOfWeek]
from [dbo].[DimDate] as [$Table]
) as [_]
where [_].[DayNumberOfWeek] > 5
) as [$Table]

I’m not sure which other data sources this is possible for, but it’s a good thing to see – you’re more likely to get good performance by doing the work back in the database than on your desktop. I wonder if it does this for Hive queries? If so, that would make it very useful for working with Hadoop/HDInsight.

And what about data sources where this isn’t possible though, like csv? I decided to test it out by loading a very large (1.4 GB) csv file I have with two columns containing integer values and 86 million rows, filtering it so that one of the column values was not null and less than 500000, and finding the resulting count of rows. Unfortunately I had to kill the query refresh after 15 minutes… In contrast, loading the same file into the Excel Data Model took almost exactly 6 minutes, and a DAX measure to do the same count/filter operation returned instantly. Not really a surprise I suppose, but it would be nice to have the option when working with csv files to load data into the Excel Data Model before you did anything else and then, when you did your transforms in Data Explorer, have those translated back to DAX queries and calculations. Alas, Data Explorer doesn’t support the Excel Data Model, PowerPivot or SSAS as data sources at the moment – I hope this changes soon.

This leads me on to the observation that in a lot of self-service BI scenarios the distinction between ETL and reporting can be blurred. If you just want to see a simple table showing your sales broken down by product in an Excel spreadsheet, you could build this quite easily with Data Explorer and not go near PowerPivot or the Excel Data Model. It would probably be easier to build too, since Data Explorer allows you to break complex operations into multiple steps, whereas putting all your logic in a single DAX expression can be mind-bendingly difficult (I’m reminded of this post I wrote two years ago). On the other hand you can do a lot of ETL-type things inside PowerPivot with calculated columns and measures; people have also been asking for the ability to use tables in the Excel Data Model as a data source for other tables (see here for example), for ETL purposes – it can be much faster and more elegant to use DAX to do some types of ETL work than SQL or SSIS. In the future I’d love to see tighter integration between Data Explorer and the Excel Data Model/PowerPivot so that we can get the best of both worlds.

Written by Chris Webb

March 6, 2013 at 4:53 pm

Calling A Web Service From Data Explorer (Power Query), Part 2

with 3 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

Don’t you hate it when you get ready to blog about something cool, and then someone comes along and beats you to it? That’s what’s just happened to me – Matt Masson just wrote an excellent blog post on how to do address lookup against a web service here:
http://www.mattmasson.com/2013/03/dynamic-lookups-with-data-explorer/
That’s what I was going to write about in part 2 of this series. I have to say he did a better job than I would have done though, and shows off several cool tricks I hadn’t seen before…

Ah well, so it goes – and anyway there are a few things I was going to say that are still worth saying. If you remember, in Part 1 I showed how to call the Google Distance Matrix API from Data Explorer, but I hard-coded the start and end points in the expression I used. How can you get the start and end points directly from Excel? Well it turns out that Data Explorer can read values direct from Excel tables.

To start with, you need to create some Excel tables to hold the values you want to use. I created two Excel tables called Origin and Destination to hold the two values I needed:

image

I can then take the value from the first row in the sole column of each table (which are again called Origin and Destination, as you can see) by using

Excel.CurrentWorkbook(){[Name="Origin"]}[Content]{0}[Origin]

and

Excel.CurrentWorkbook(){[Name="Destination"]}[Content]{0}[Destination]

Here’s the full source expression with the hard-coded values replaced:

=
Xml.Document(
Web.Contents(
http://maps.googleapis.com/maps/api/distancematrix/xml&#8221;
, [Query = [
origins= Excel.CurrentWorkbook(){[Name="Origin"]}[Content]{0}[Origin]
, destinations = Excel.CurrentWorkbook(){[Name="Destination"]}[Content]{0}[Destination]
, mode = “driving”, sensor = “false” ] ]))

Now, you can enter any address in the Origin and Destination tables, click Refresh on the Query as shown in the screenshot below:

image

And the query will be rerun with the new parameters. Note that it might also be a good idea to URL encode any text that you enter in these tables, using the EncodeURL() Excel function; I haven’t done so here because I’m using UK postcodes, which don’t need URL encoding, as my origins and destinations to make things simple.

It’s also possible to use the results of one query inside another query. Data Explorer allows you to use an Excel table as the source for a query:

image

And selecting the Destination Excel table as a source gives the following expression:

= Excel.CurrentWorkbook(){[Name="Destination"]}[Content]

The easy way to use any value from an existing query as the starting point for a new query is to right-click on a cell and select Add As New Query:

image

Doing this creates a new query with the value you clicked on a source:

image

Here’s the expression for the only step in the newly-created query:

= Destination{0}[Destination]

image

Even if this new query isn’t any use on its own, the expression shown here can be used in other queries. Here’s the original web service call adapted to use two other queries as its source:

= Xml.Document(
Web.Contents(
http://maps.googleapis.com/maps/api/distancematrix/xml&#8221;
, [Query = [
origins= Origin{0}[Origin]
, destinations = Destination{0}[Destination]
, mode = “driving”, sensor = “false” ] ]))

However, as Matt shows in his post, this only works if Fast Combine is enabled, otherwise you’ll see an error message.

You can download the sample workbook for this post here.

Written by Chris Webb

March 4, 2013 at 8:53 pm

Calling A Web Service From Data Explorer (Power Query), Part 1

with 9 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

Last week I showed how you could use the WebService() function in Excel 2013 to bring location data into Excel 2013. Since this is a topic I have a particular interest in, this week I’ll show you how to do the same thing all over again but in Data Explorer.

First of all, a simple example. In that previous post I used the Google Distance Matrix API to calculate the distance between two points; for example the following call shows how long it would take me to drive from my home to Buckingham Palace to see the Queen (52 minutes in case you’re wondering):
http://maps.googleapis.com/maps/api/distancematrix/xml?origins=HP66HF&destinations=SW1A1AA&mode=driving&sensor=false

The following post on the Data Explorer forum from James Terwilliger gives some helpful tips on how to consume web services from within Data Explorer:
http://social.msdn.microsoft.com/Forums/en-US/dataexplorer/thread/069b50e3-ab9e-4ee4-99a9-23440fcfc768

…but it’s not altogether straightforward. For example if you paste the link above into the From Web data source, you do get something returned but it’s extremely hard to find any useful data. Instead, I found the following steps worked:

  • First, hit From Web and enter something in the URL box:
    image
  • This gives you a new web query, but you want to discard any auto-generated code in the first step. Instead, paste the following expression:= Xml.Document(
    Web.Contents(“http://maps.googleapis.com/maps/api/distancematrix/xml”
    , [Query = [ origins= "HP66HF", destinations = "SW1A1AA", mode = "driving", sensor = "false" ] ]))

    image

    This uses Web.Contents() to call the web service (as described in that forums reply) with the appropriate parameters. Xml.Document() is then used to interpret the response as an XML document.

  • With this done, it’s quite easy to navigate through the XML by clicking on the Table links in each step to find the useful data:
    image
  • And finally hit Done to surface it the worksheet:
    image

Some thoughts at this point: I don’t like the way the DE formula language is case-sensitive, and I suspect in the long run it will have to be either hidden or replaced with VBA or Excel formula language/DAX if it’s going to be used even by Excel power users. It is very, very powerful though, and luckily the UI is good enough to mean that 99% of users will never need to write DE formula language anyway.

The next question: I’ve hard-coded my origins and destinations in this example, but how can I read these values from the worksheet without my users having to open Data Explorer and edit the query? Tune in for Part 2 to find out!

Written by Chris Webb

March 4, 2013 at 8:30 am

Importing Data From Multiple Log Files Using Data Explorer (Power Query)

with 17 comments

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

It’s only been two days since the official Preview release of Data Explorer and already the blog posts about it are coming thick and fast. Here are some of the more interesting ones that I’ve seen that show what’s possible with it:
http://sqlblog.com/blogs/jamie_thomson/archive/2013/02/28/traversing-the-facebook-graph-using-data-explorer.aspx
http://www.mattmasson.com/index.php/2013/03/access-the-windows-azure-marketplace-from-data-explorer/
http://community.altiusconsulting.com/best-oscar-winning-film-my-first-data-explorer-adventure/
http://www.spsdemo.com/blog/Lists/Posts/Post.aspx?List=c67861cd-a0d9-4ed8-9d9d-9b29652a516f&ID=371&Web=f74569c2-ae3f-42c6-a3fa-9f099dfaeb7f

Obviously I can’t let everyone else have all the fun, so I thought I’d show how you can use Data Explorer to import data from multiple files, clean it, load it into a single table and then report on it.

First of all, the data. Like all bloggers I have an unhealthy interest in my blog stats, and one of the ways I monitor the hits on this site is using Statcounter. I’m also a bit of a miser, though, so I only use their freebie service and that means that I only get to see stats on the last 500 site visits. How can I analyse this data then? Well, Statcounter allow you to download log data as a csv file, so at about 2:30pm I downloaded one file and at 8:30pm I downloaded another.

Now, the first cool thing to show about Data Explorer is that you can import and merge data from multiple files with the same structure if they’re in the same folder. With both of my files in a folder called Blog Logs, and Excel open, the first thing you need to do is to the Data Explorer tab and hit From File/From Folder:

image

The next step is to enter the name of the folder with the files in in the dialog:

image

With that done, a new Query screen appears with a list of the files in the folder:

image

You then need to hit the icon with the two down arrows and a horizontal line that I’ve highlighted in the screenshot above, next to the Content heading. This then shows the data in the files (obviously I’ve had to scrub out the sensitive data here):

image

You can then use the first row as the column headers:

image

Filter the data so that the row with the second set of column headers is removed (I wonder if there’s a way to do this automatically when importing multiple csv files?) by clicking on the Date and Time column and deselecting the value “Date and Time” as shown:

image

Right-click on each column you don’t want to import (such as IP Address) and selecting Hide:

image

Right-click on the Date and Time column and select Remove Duplicates to remove any records that appear in both log files (I’m assuming that there were no cases where two people hit a page at exactly the same date and time, which of course may not be completely correct):

image

And force the Date and Time column to be treated as a Date/Time type:

image

And bingo, you’re done. Here are all the steps in the import, all of which can be edited, deleted, reordered etc:

image

The data is then loaded into a table in a worksheet (though you can turn that off), and by clicking “Load to data model” in the Query Settings pane you can load the data into the Excel data model:

image

(NB I found some issues with loading date data into the data model and US/European date formats that I’ve reported here, but don’t forget this is beta software so there are bound to be problems like this)

You can build cool Power View reports using this data:

image

Or even explore it on a 3D map with GeoFlow:

image

Fun, isn’t it?

Written by Chris Webb

March 1, 2013 at 10:48 pm

Follow

Get every new post delivered to your Inbox.

Join 3,113 other followers