Creating Histograms With Power Query

A few months ago someone at a conference asked me what the Power Query Table.Partition() function could be used for, and I had to admit I had no idea. However, when I thought about it, I realised one obvious use: for creating histograms! Now I know there are lots of other good ways to create histograms in Excel but here’s one more, and hopefully it will satisfy the curiosity of anyone else who is wondering about Table.Partition().

Let’s start with a table in Excel (called “Customers”) containing a list of names and ages:

image

Here’s the M code for the query to find the buckets:

let

    //Get data from Customers table

    Source = Excel.CurrentWorkbook(){[Name="Customers"]}[Content],

    //Get a list of all the values in the Age column

    Ages = Table.Column(Source,"Age"),

    //Find the maximum age

    MaxAge = List.Max(Ages),

    //The number of buckets is the max age divided by ten, then rounded up to the nearest integer

    NumberOfBuckets = Number.RoundUp(MaxAge/10),

    //Hash function to determine which bucket each customer goes into

    BucketHashFunction = (age) => Number.RoundDown(age/10),

    //Use Table.Partition() to split the table into multiple buckets

    CreateBuckets = Table.Partition(Source, "Age", NumberOfBuckets, BucketHashFunction),

    //Turn the resulting list into a table

    #"Table from List" = Table.FromList(CreateBuckets, Splitter.SplitByNothing()

                           , null, null, ExtraValues.Error),

    //Add a zero-based index column

    #"Added Index" = Table.AddIndexColumn(#"Table from List", "Index", 0, 1),

    //Calculate the name of each bucket

    #"Added Custom" = Table.AddColumn(#"Added Index", "Bucket", 

                        each Number.ToText([Index]*10) & " to " & Number.ToText(([Index]+1)*10)),

    //Find the number of rows in each bucket - ie the count of customers

    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Count", each Table.RowCount([Column1])),

    //Remove unnecessary columns

    #"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Column1", "Index"})

in

    #"Removed Columns"

 

And here’s the output in Excel, with a bar chart:

image 

How does this work?

  • After loading the data from the Excel table in the Source step, the first problem is to determine how many buckets we’ll need. This is fairly straightforward: I use Table.Column() to get a list containing all of the values in the Age column, then use List.Max() to find the maximum age, then divide this number by ten and round up to the nearest integer.
  • Now for Table.Partition(). The first thing to understand about this function is what it returns: it takes a table and returns a list of tables, so you start with one table and end up with multiple tables. Each row from the original table will end up in one of the output tables. A list object is something like an array.
  • One of the parameters that the Table.Partition() function needs is a hash function that determines which bucket table each row from the original table goes into. The BucketHashFunction step serves this purpose here: it takes a value, divides it by ten and rounds the result down; for example pass in the age 88 and you get the value 8 back.
  • The CreateBuckets step calls Table.Partition() with the four parameters it needs: the name of the table to partition, the column to partition by, the number of buckets to create and the hash function. For each row in the original table the age of each customer is passed to the hash function. The number that the hash function returns is the index of the table in the list that Table.Partition() returns. In the example above nine buckets are created, so Table.Partition() returns a list containing nine tables; for the age 8, the hash function returns 0 so the row is put in the table at index 0 in the list; for the age 88 the hash function returns 8, so the row is put in the table at index 8 in the list. The output of this step, the list of tables, looks like this:

    image
  • The next thing to do is to convert the list itself to a table, then add a custom column to show the names for each bucket. This is achieved by adding a zero-based index column and then using that index value to generate the required text in the step #”Added Custom”.
  • Next, find the number of customers in each bucket. Remember that at this point the query still includes a column (called “Column1”) that contains a value of type table, so all that is needed is to create another custom column that calls Table.RowCount() for each bucket table, as seen in the step #”Added Custom1”.
  • Finally I remove the columns that aren’t needed for the output table.

I’m not convinced this is the most efficient solution for large data sets (I bet query folding stops very early on if you try this on a SQL Server data source) but it’s a good example of how Table.Partition() works. What other uses for it can you think of?

You can download the sample workbook here.

19 thoughts on “Creating Histograms With Power Query

  1. Is it possible to use the Table.Partition() as the separation of business logic? I’m thinking in a big fact table, you need to perform certain logic before year 2000, some logic between 2000 and 2008 and other logic after 2008. Instead of writing a series of if statement, use the Table.Partition() to split the big fact table into different chunk and perform different logic (e.g. invoke functions). I’m not sure whether it is a good idea 🙂

    1. I do think it’s a good idea George…. 🙂
      By the way, thanks so much Chris by sharing with us your knowledge!
      And thanks to Microsoft as well for the exceptional BI tools they are implementing in Excel… I’m a HR manager in Brazil and people get really impressed with the dynamic reports I’ve been building using power query/ power pivot and Power view!! These tools are really changing my life and making people believe I’m in the wrong area hahaha…
      Regards, Daniel

      1. Hi Chris & George,

        Just wondering if you have any examples on how to do the following? ” use the Table.Partition() to split the big fact table into different chunk and perform different logic (e.g. invoke functions)”

        I’ve been trying to explore how to perform different logic after the table has been partitioned, but really struggling.

        Thanks!
        Josh

  2. Hey Chris – I found your site via feedly so kudos to them for hooking up a killer app. Secondly it seems like the logical thing to do with table.partition. I would see even some use in this to group out salesman performance for a particular year based on the total dollar value sold. So your buckets would be total dollars/value sold over a predefined period and then your count would be the number of salesman for the company that fit into each of the categories of total value sold.

    Anyway beauty explanation of this. Thanks for sharing!

    Brad

  3. Reblogged this on BRAD EDGAR and commented:
    I thought I would share with you all an interesting read on how to use the table.partition function in Excel to create a histogram. The content is directly from Chris Webb’s blog and definitely is an awesome read if you are interested. I started to think about what other ways this data table grouping function could come in handy and thought I would look to you, the readers to see if you had any interesting ideas or inputs! Enjoy and make sure to tweet Chris’ content!

  4. Wow, that’s awesome. I was able to follow this and make a histogram for a project I’m working on. I do not know MDX so my question is with the buckets you made (0 to 10, 10 to 20, 20 to 30, etc.) is that “0 inclusive to 9 inclusive”, “10 inclusive to 19 inclusive”, “20 inclusive to 29 inclusive”, etc.??

  5. Hi Chris,

    One of the challenges I’m having with this is my data is very skewed, which means the buckets don’t group the data in a useful way. What I really need is a way to find the median value and then create 5 buckets either side of that, spaced out in a sensible way, perhaps using standard deviations.

    Thanks for pointing me in the right direction though, it’s proving to be quite a challenge!

    Nick

    1. Create a new query with the “Blank Query” option, then open the Advanced Editor window, delete everything there and paste in the code from this post.

  6. Table.Partition supports Query Folding, when wrapped in Table.Combine(). Exploring this a bit more now. The List.Max() creates multiples calls for each partition but that’s to be expected. Would really need to find a massive dataset to see if there’s any gains.

    1. All kinds of interesting things are possible with Table.Partition, I think. I need to investigate whether it can be used to optimise memory-intensive operations – my post last week on measuring memory usage was preparing the ground for this kind of thing.

      1. Did you ever explore more interesting use cases with Table.Partition and Table.PartitionValues?

  7. Hi Chris,
    thanks for your amazing blog! It helped me in many issues.

    Currently I’m working on an issue, where I receive logging data from a machine: 130 parameters each second, so some log-files reach up to 3-digit MByte volumes. And each “process” we run on the machine creates it’s own log-file, sometimes up to 5 a day, sometimes a process might last for almost a full day…
    Using PQ I try to analyse this set of multi-GB-content, currently >2500 files.

    And, here is my point: each “process” consists of many steps, from 3 up to >250. Some of these steps last for hours (for instance, when the machine expects, within a step, some user interaction over night, but operators shows up no earlier than 6am…).

    Now I want to know (as one of the questions, I want to get answered – beside some others…) whether or not the machine meets some set conditions during some of theses steps. I guess, therefore the Table.Partition-function might be helpful to study each step and find out, whether or not this condition is met during the step or not.

    But, my question: up to now I can’t find any useful explanation regarding the corresponding function “Table.PartitionValue” – do you know something about it? Microsoft doesn’t tell too much, obviously just simply a copy-paste action of the Table.Partition-function help site.

    Addtionally I need to learn, how to work with a ‘list of tables’…
    Again, thanks a lot!

  8. Dear Chris,
    I have tried your power query with other data and it seems to count 1 more point in the first bucket that should actually belong to the last bucket.
    I have got this error when having ages up to 110 in the data.
    Regards,
    Markus

Leave a Reply