Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Join Conditions in Power Query, Part 2: Events-In-Progress, Performance and Query Folding

with 3 comments

In my last post you saw how to join two tables together using conditions other than the built-in inner, outer and anti join conditions. However, as I mentioned, you need to be aware of some of the performance implications of using this technique – and that is the subject I’ll be looking at in this post.

Let’s take the events-in-progress problem, one that I have blogged about many times in the past (see here, here, here and here for example) and see how we can solve it in Power Query. It’s a very common problem that you encounter when you have a fact table where each row represents an event, there are columns containing the start and end dates of each event, and you want to count the number of events that were in progress on any given date.

Using the Adventure Works DW database, my normal way of illustrating the problem is this: taking the DimDate and FactInternetSales tables, for each date show the number of purchases that had been ordered but not shipped on that date. One way of solving this problem (but not the only way, as my previous blog posts show) is to do a cross join between the DimDate and FactInternetSales tables, then filter the result so that you get the rows where the value in the DateKey column from the DimDate table is between the values in the OrderDateKey and the ShipDateKey columns from the FactInternetSales table.

As a first attempt you might come up with a query like this one, which uses a custom column containing a table value that is subsequently expanded to do a cross join (the technique shown in my previous post):

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "Adventure Works DW"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except the DateKey column

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Insert a custom column that contains the whole of FactInternetSales as a table in each row

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "FactInternetSalesTable", 

                        each Source{[Schema="dbo",Item="FactInternetSales"]}[Data]),

    //Expand this new column to show the OrderDateKey and ShipDateKey columns

    #"Expand FactInternetSalesTable" = Table.ExpandTableColumn(InsertedCustom, 

                        "FactInternetSalesTable", 

                        {"OrderDateKey", "ShipDateKey"}, 

                        {"FactInternetSalesTable.OrderDateKey", 

                        "FactInternetSalesTable.ShipDateKey"}),

    //Filter where DateKey is greater than or equal to OrderDateKey and 

    //DateKey is less than or equal to ShipDateKey

    FilteredRows = Table.SelectRows(#"Expand FactInternetSalesTable", 

                        each [DateKey] >= [FactInternetSalesTable.OrderDateKey] and 

                        [DateKey] <= [FactInternetSalesTable.ShipDateKey]),

    //Find the count of the number of rows grouped by DateKey

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                        {{"Count", each Table.RowCount(_), type number}})

in

    GroupedRows

There is, however, a big problem with this query: on my laptop it runs and runs forever – well, maybe not forever but I cancelled it after several minutes. Some tuning is necessary.

I don’t think anyone outside the Power Query dev team has much experience of performance tuning Power Query yet. However there is one golden rule that I do know: where possible, allow Power Query to push as much of the work back to the data source. This behaviour is known as “query folding” and it’s something that I’ve blogged about, as have Matt Masson and Darren Gosbell. Looking in SQL Server Profiler for the query above it is clear that no query folding is taking place: the only activity visible is Power Query reading the data from the DimDate and FactInternetSales tables separately.

After a bit of trial and error I came up with the following alternative:

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "adventure works dw"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except DateKey

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Add a custom column to DimDate containing the value 1

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "Dummy", each 1),

    //Get data from the FactInternetSales table

    dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

    //Remove all columns except OrderDateKey and ShipDateKey

    RemovedOtherColumns1 = Table.SelectColumns(dbo_FactInternetSales,

                        {"OrderDateKey", "ShipDateKey"}),

    //Add a custom column to FactInternetSales containing the value 1

    InsertedCustom1 = Table.AddColumn(RemovedOtherColumns1, "Dummy", each 1),

    //Join DimDate and FactInternetSales on the two columns that contain 1

    Custom1 = Table.Join(InsertedCustom1, "Dummy", InsertedCustom, "Dummy"),

    //Filter rows where DateKey is between OrderDateKey and ShipDateKey

    FilteredRows = Table.SelectRows(Custom1, 

                    each [DateKey] >= [OrderDateKey] and 

                    [DateKey] <= [ShipDateKey]),

    //Group by DateKey and find the number of rows for each date

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                    {{"Count", each Table.RowCount(_), type number}}),

    //Sort dates in ascending order

    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})

in

    SortedRows

This returns the correct result more or less instantly:

image

The main difference between this query and the previous one is how I’m getting a cross join between the two tables. This time I’m creating custom columns on DimDate and FactInternetSales that both contain the value 1, and then doing an inner join between this two tables on the new columns – which of course results in the equivalent of a cross join.

In Profiler I can see the following SQL query being generated by Power Query:

select [_].[DateKey],

    [_].[Count]

from 

(

    select [rows].[DateKey] as [DateKey],

        count(1) as [Count]

    from 

    (

        select [_].[OrderDateKey],

            [_].[ShipDateKey],

            [_].[Dummy],

            [_].[DateKey]

        from 

        (

            select [$Outer].[OrderDateKey],

                [$Outer].[ShipDateKey],

                [$Inner].[Dummy],

                [$Inner].[DateKey]

            from 

            (

                select [_].[OrderDateKey] as [OrderDateKey],

                    [_].[ShipDateKey] as [ShipDateKey],

                    1 as [Dummy]

                from 

                (

                    select [OrderDateKey],

                        [ShipDateKey]

                    from [dbo].[FactInternetSales] as [$Table]

                ) as [_]

            ) as [$Outer]

            inner join 

            (

                select [_].[DateKey] as [DateKey],

                    1 as [Dummy]

                from 

                (

                    select [DateKey]

                    from [dbo].[DimDate] as [$Table]

                ) as [_]

            ) as [$Inner] on ([$Outer].[Dummy] = [$Inner].[Dummy])

        ) as [_]

        where [_].[DateKey] >= [_].[OrderDateKey] and [_].[DateKey] <= [_].[ShipDateKey]

    ) as [rows]

    group by [DateKey]

) as [_]

order by [_].[DateKey]

 

Query folding is definitely taking place now!

The last question to ask here is whether the first query was slow because query folding was not taking place, or slow because of the way the query was written. You can test this quite easily by rewriting the second query to prevent query folding taking place using the Table.Buffer() function. For example, in the second query the step to get the data from the DimDate table is:

dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

To prevent query folding it needs to be altered to:

dbo_DimDate = Table.Buffer(Source{[Schema="dbo",Item="DimDate"]}[Data]),

(The step to get data from the FactInternetSales table needs to be altered in the same way.)

With this change made the query now executes in around a minute. So clearly the new query is more efficient when it is executed inside the Power Query engine itself, without query folding, but the Power Query engine is still nowhere near as fast as SQL Server and query folding gives the best possible performance.

You can download the sample workbook here.

Written by Chris Webb

June 4, 2014 at 9:24 pm

Posted in Power Query

3 Responses

Subscribe to comments with RSS.

  1. Hi Chris,

    This is very good work. I’ve tried various techniques, including repeating rows in each table (Table.Repeat). However, the inner join is the cleanest and fastest method. On the TechNet forum, you mentioned that your solution is a bit of a hack. However, I’ll disagree on the following grounds:

    1) Your query is quite fast. I’ve pasted a native SQL query in Power Query (PQ) to return the same results as your inner join (I used Cross Join in the SQL query), and the performance is no better than your PQ query. This test suggests that even if cross join was supported natively in PQ, it’ll be no better in performance than the “hack” you’ve done.
    2) Packaged as a custom function, cross join can be used as if it were a native function, minimizing the need for built-in support. For example:
    Table_CrossJoin =
    (table1 as table,table2 as table, optional table2Name as nullable text) as table=>
    let
    Table1InsertedCustom = Table.AddColumn(table1,”JoinIndex”, each 1),
    Table2InsertedCustom = Table.AddColumn(table2,”JoinIndex”, each 1),
    Table2Name = if table2Name = null then “_” else table2Name,
    JoinedTables = Table.Join(Table1InsertedCustom,”JoinIndex”,Table.PrefixColumns(Table2InsertedCustom,Table2Name),Table2Name&”.JoinIndex”),
    RemovedColumns = Table.RemoveColumns(JoinedTables,{“JoinIndex”,Table2Name&”.JoinIndex”})
    in
    RemovedColumns

    And your query would look like:
    let
    Source = Sql.Database(“localhost”, “adventure works dw”),,
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    DimDateRetainedColumns = Table.SelectColumns(dbo_DimDate,{“DateKey”}),
    dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
    FactSalesRetainedColumns = Table.SelectColumns(dbo_FactInternetSales,
    {“OrderDateKey”, “ShipDateKey”}),
    JoinedTables = Table_CrossJoin(DimDateRetainedColumns,FactSalesRetainedColumns,”FactInternetSales”),
    FilteredRows = Table.SelectRows(JoinedTables,
    each [DateKey] >= [FactInternetSales.OrderDateKey] and
    [DateKey] <= [FactInternetSales.ShipDateKey]),
    GroupedRows = Table.Group(FilteredRows, {"DateKey"},
    {{"Number of Events", each Table.RowCount(_), type number}}),
    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})
    in
    SortedRows

    There are probably good reasons that a crossjoin option was omitted from the PQ join functions. I don't believe that the target audience for PQ would know how to use this join properly. For example, if you have large tables, and you don't filter the table after the join, you run the risk of very long wait times and out of memory errors. I'm not sure if there is any other built-in function that could land you in this kind of trouble. In the AdventureWorks sample, the cross join produces 132,332,018 records if not filtered. AdventureWorks represents only a small dataset compared to real-world data!

    Colin Banfield

    July 10, 2014 at 3:31 pm

  2. Great Article! Very useful! Thank you!

    Uvarov Maxim

    July 22, 2014 at 12:57 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,240 other followers

%d bloggers like this: