The partialBatch Commit Mode In The Power BI Enhanced Refresh API

I have always wondered what the partialBatch option for the commitMode parameter in the Enhanced Refresh API does exactly. There is some documentation here and here but I was curious to find out more as part of the research I’m doing for my ongoing series on Power BI refresh memory errors, in case it was useful for reducing overall memory usage (spoiler: it may be). In this post I’ll share what I found out after running some tests.

For my testing I created a semantic model with four tables called A, B, C and D. Each semantic model table used an M expression similar to the following as its source:

Function.InvokeAfter(
  () => #table(type table [MyCol = text], {{"A"}}),
  #duration(0, 0, 0, 10)
)

This expression – the expression for table A – returns a table with one column and one row after a delay of 10 seconds; the expressions for tables B, C and D were almost identical but had delays of 20, 30 and 40 seconds respectively. These delays ensured that when the tables refreshed they always finished in a set order if they were refreshed in parallel. There were no relationships or measures in the model.

I published this model and then created a notebook to refresh it using Semantic Link’s refresh_dataset method (hat tip to Phil Seamark, whose code I stole) which uses the Enhanced Refresh API behind the scenes:

import sempy.fabric as fabric
WorkspaceName = "CW partialBatch Tests"
SemanticModelName = "partialBatchTest"

# run the refresh 
request_status_id = fabric.refresh_dataset(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full", commit_mode="partialBatch", retry_count=1, max_parallelism=4)
print("Progress:", end="")

while True:
    status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
    if status == "Completed":
        break
        print("░", end="")
        time.sleep(2)

print(": refresh complete")

Note that the commit_mode parameter is set to partialBatch, refresh_type is set to full and that, at this point, max_parallelism was set to 4, which is the number of tables in the model and which meant that all four tables refreshed in parallel.

I ran a Profiler trace while running this code to refresh the model and observed the following happening:

  • First, a refresh of type ClearValues (which clears all data from a table) was run in a single transaction for all four tables. See here for more details about each type of refresh.
  • Next a refresh of type DataOnly (which loads the data into a table but nothing more) was run in a single transaction for all four tables.
  • Finally, a refresh of type Calculate (which builds things like calculated tables, calculated columns, hierarchies and relationships) was run in a single transaction for all four tables. After this, all four tables were fully refreshed.

There are two important things to note about this:

  1. By running a refresh of type ClearValues first, to clear all the data out of the model, the memory consumption of the model is reduced to almost nothing. The downside of this is that the model is no longer queryable until the next two refreshes complete.
  2. Running two separate refreshes of type DataOnly and Calculate, rather than a single refresh of type Full, also reduces peak memory usage although it will also probably be slower.

For my second test I changed the definition of table C so that there was a 50% chance it returned an error when refreshed if an M parameter called FailRefresh was set to “Y”:

if FailRefresh = "Y" and Number.Random() > 0.5 then
  Function.InvokeAfter(
    () =>
      error Error.Record("Forced Refresh Failure"), 
    #duration(0, 0, 0, 30)
  )
else
  Function.InvokeAfter(
    () => #table(type table [MyCol = text], {{"C"}}), 
    #duration(0, 0, 0, 30)
  )

I then published the model again, changed my notebook code so that the retry_count parameter of refresh_dataset was set to 2 (see here for more detail about refresh retries), and ran the refresh until I got an instance where table C’s refresh failed the first time but succeeded on the retry. I observed the following:

  • First, a refresh of type ClearValues was run in a single transaction for all four tables as before.
  • Next, a refresh of type DataOnly was run in a single transaction for all four tables. Although the refreshes for tables A and B, which took 10 and 20 seconds, completed successfully the transaction failed because the refresh for table C returned an error after 30 seconds. The refresh for table D, which would have taken 40 seconds, did not complete.
  • Next, because retry_count was set to 2, the refresh of type DataOnly was run again for all four tables. This succeeded. Note that it did not try to rerun the ClearValues refresh again and that tables A and B were refreshed all over again.
  • Finally a refresh of type Calculate was run in a single transaction for all four tables, as before.

For my final test I kept everything the same except for setting the max_parallelism parameter to 1. I then refreshed until I once again found a case where table C’s refresh failed first time but succeeded on the retry. This time I observed the following:

  • First, a refresh of type ClearValues was run in a single transaction for all four tables as before except with a MaxParallelism of 1.
  • Next, a refresh of type DataOnly was run for just table A in its own transaction. It succeeded.
  • Next, a refresh of type DataOnly was run for just table B in its own transaction. It succeeded.
  • Next, a refresh of type DataOnly was run for just table C in its own transaction. It failed.
  • Next, a refresh of type DataOnly was run for just table C in its own transaction. It succeeded, and this represents the start of the retry.
  • Next, a refresh of type DataOnly was run for just table D in its own transaction. It succeeded.
  • Finally a refresh of type Calculate was run in a single transaction for all four tables, as before except with a MaxParallelism of 1.

This is interesting because, in contrast with the previous test, when table C failed and the retry started, tables A and B did not get refreshed a second time.

What can we learn from this?

  • The way the partialBatch commit mode splits up a refresh into three separate refreshes of type ClearValues, DataOnly and Calculate will reduce the overall peak memory consumption during a refresh, which is useful if your refresh is failing because you are using too much memory.
  • It will result in slower refresh performance overall and the model will not be queryable for most of the refresh, however.
  • If one of the tables in your model fails to refresh then reducing the amount of parallelism and setting a retry_count of more than 0 may mean that the tables that did refresh successfully first time around may not need to be refreshed again. Whether this results in a faster overall refresh though depends on whether not needing to refresh tables again counteracts the effect of less parallelism.

Overall I think using the partialBatch commit mode may be useful if you need to reduce memory usage during refresh or have very unreliable data sources. However if you need this type of fine grained control over your refreshes you’re probably better off writing more complex code (for example using a Fabric notebook as I did here) to control exactly what gets refreshed and how, rather than just making a single call to the Enhanced Refresh API using this option.

[Thanks to Jast Lu for answering my questions about this topic]

Power BI Semantic Model Memory Errors, Part 2: Max Offline Semantic Model Size

In the Power BI Service, Import mode models are stored offline and paged into memory only when they are needed – for example because someone runs a report that uses the model as its source. As discussed in my last post, though, there’s a limit on the amount of memory that a model can use which varies by the size and type of capacity you’re using using. There’s also an extra complication in that Import mode models are stored offline in a compressed format that means the Power BI Service doesn’t know exactly how much memory will be needed if the entire model needs to be held in memory. As a result there is an additional limit enforced on the size of the offline copy of Import mode models to ensure they don’t use too much memory when they are eventually paged in.

This limit can be configured in the Admin Portal by setting the Max Offline Semantic Model Size property (still shown as the Max Offline Dataset Size in the UI at the time of writing):

The default value of this property is 0, which means that the limit will be set to the maximum allowed value for the capacity SKU you’re using. Different maximum limits are enforced if you have turned on the Large semantic model storage format option for your model and if you haven’t. The maximum limits for the Large model format are the same as the limits on the maximum amount of memory that a model can use, as listed in the table here; for the Small model format they are published here. You can also set the property to a value that is lower than the allowed limit if you want to control the size of the models that your developers can publish – smaller models generally use fewer CUs during refresh or querying, which could help reduce the load on your capacity.

To test what happens when you exceed this limit I created a new F16 capacity (which has a maximum offline model size of 5GB), and then created a workspace on it containing an Import mode Power BI semantic model using the Large model format that was 3.3GB in size. I then scaled my capacity down to an F2 and then tried running a DAX query against the model. The Max Offline Semantic Model Size limit for an F2 capacity for models with the Large model format enabled is 3GB, which is less than the size of the model I had created. As a result the query returned the following error message:

Database ‘xyz’ exceeds the maximum size limit on disk; the size of the database to be loaded or committed is 3500853108 bytes, and the valid size limit is 3221225472 bytes. If using Power BI Premium, the maximum DB size is defined by the customer SKU size (hard limit) and the max dataset size from the Capacity Settings page in the Power BI Portal.

If you’re getting this error then you need to reduce the size of your model, for example by reducing the number of rows in your tables, removing unused columns, or reducing the number of distinct values in the columns you keep. You can find more suggestions on how to reduce model size here.

[Thanks to Akshai Mirchandani for answering my questions on this topic]

Power BI Semantic Model Memory Errors, Part 1: Model Size

You probably know that semantic models in Power BI can use a fixed amount of memory. This is true of all types of semantic model – Import, Direct Lake and DirectQuery – but it’s not something you usually need to worry about for DirectQuery mode. The amount of memory they can use depends on whether you’re using Shared (aka Pro) or a Premium/Fabric capacity, and if you’re using a capacity how large that capacity is. In Shared/Pro the maximum amount of memory that a semantic model can use is 1GB; if you are using a capacity then the amount of memory available for models in each SKU is documented in the table here in the Max Memory column:

What counts as “memory usage” though? More importantly, how can you breach this limit and what do all of the different memory-related error messages that you might see mean? In this series I will try to answer these questions, and in this post I will look at one particular error you see when your model needs to use more memory than it is allowed to.

First of all it’s important to understand that the amount of memory used by a semantic model is not the same as the amount of data “in” the model. The diagram below shows how model memory usage can be broken down. The data in the columns and tables of your model, along with supporting objects like relationships (represented by the blue box in the diagram below) makes up just one part of the overall model memory usage. In addition, more memory is needed to store data associated with row-level security, user sessions, caches and so on (represented by the orange box in the diagram below).

Both Import mode and Direct Lake models can page data in and out of memory as required, so the whole model may not be in memory at any given time. However, in order for a query to run, the data it needs must be in memory and cannot be paged out until the query has finished with it. Therefore out of all the memory consumed by a semantic model, at any given time, some of that memory is “evictable” because it isn’t in use while some of it is “non-evictable” because it is being used. Evictable memory may be paged out of memory for a variety of reasons, for example because the model is nearing its allowed memory limit.

One further factor to take into account is the memory used by queries that are running on the model (the purple boxes in the diagram above). While each query has a limit on the amount of memory it can use – I mentioned the Query Memory Limit in this post but I will revisit it later on in this series – the total amount of memory used by queries also contributes to the overall memory use of a semantic model. If you have a large number of concurrent queries running, even if no single query uses much memory, this can contribute a lot to the overall memory usage of your model.

In summary then, the total amount of memory used by a semantic model is made up of three groups:

  1. The data in the tables in your model (the blue box above)
  2. Supporting data for RLS security roles, sessions and caches (the orange box above)
  3. Data used by queries (the purple boxes above)

When the sum of these three groups exceeds the total amount of memory allowed for your model, and no data can be evicted from memory to reduce this sum, then you’ll get an error.

To illustrate this I created a new F2 capacity, which has a 3GB limit on the amount of memory used by a semantic model, loaded a table (called SourceData) with 3.5 million rows of random numbers stored as text into a Lakehouse, then created a new custom Direct Lake semantic model on it. I set the Direct Lake Behavior property on the model to “Direct Lake only” to prevent fallback to DirectQuery mode.

After creating the model I used DAX Studio’s Model Metrics feature with the “Read statistics from data” option turned off to find the amount of data stored in memory (ie the blue box value).

Unsurprisingly, at this stage, the size of the model was very small: only 8KB.

I then turned the “Read statistics from data” option on, knowing that this would force data to be paged into memory. This showed the total potential size of the model to be 4.25GB:

I was initially confused by this because this is already well over the 3GB limit, but it was pointed out to me that what is probably happening is that DAX Studio runs a number of DMV queries to get the data needed to calculate this value and when this happens different parts of the model are paged in and out of memory. It was certainly very slow for DAX Studio to calculate the Model Metrics when I did this which fits with the paging in/out theory.

Finally, I ran a simple DAX query to get the top 10 rows from the SourceData table:

EVALUATE TOPN(10, SourceData)

This query ran for about ten seconds and then failed with the following error message:

Resource Governing: We cannot complete the requested operation because there isn’t enough memory (consumed memory 4620 MB, memory limit 3072 MB). Either reduce the size of your dataset, such as by limiting the amount of in-memory data, or host the dataset on a Fabric or Premium capacity with a sufficient memory size. See https://go.microsoft.com/fwlink/?linkid=2159753 to learn more.

[The error code associated with this message is 0xC13E0006 or -1052901370]

This is the error that you get when your model needs to use more memory than it is allowed to use for the capacity SKU it is running on. The query references every column from the only table in the model, which means the whole table – which is the whole model – would have to be paged in to memory for the query to run, but the whole model requires more memory than is available on an F2 capacity.

If you aren’t getting this exact error message then something slightly different might be happening. In future posts in this series I will look at some of these other errors including the query memory limit and the command memory limit.

[Thanks to Marius Dumitru for the information in this post]

Power BI/Data Books Roundup

It’s time for another short post on the free books that various authors have been kind enough to send me over the last few months. Full disclosure: these aren’t reviews as such, they’re more like free publicity in return for the free books, and I don’t pretend to be unbiased; also the Amazon UK links have a affiliate code in that gives me a kickback if you buy any of these books.

Deciphering Data Architectures, James Serra

I’ll be honest, I’ve had this book hanging around in my inbox since February and I wasn’t sure what to expect of it, but when I finally got round to reading it I enjoyed it a lot and found it very useful. If you’re looking for clear, concise explanations of all of the jargon and methodologies that are in use in the data industry today then this is the book for you. Do you want to understand the difference between Kimball and Inmon? Get an honest overview of data mesh? Choose between a data lake and a relational data warehouse? It’s all here and more. It’s an opinionated book (which I appreciate) and quite funny in places too. Definitely a book for every junior BI consultant to read and for more senior people to have handy to fill in gaps in their knowledge.

Extending Power BI with Python and R (second edition), Luca Zavarella

I posted about the first edition of this book back in 2021; this new edition has several new chapters about optimising R and Python settings, using Intel’s Math Kernel library for performance and addressing integration challenges. As before this is all fascinating stuff that no-one else in the Power BI world is talking about. I feel like a future third edition covering what will be possible with Power BI and Python in Fabric in 2-3 years will be really cool.

Data Cleaning with Power BI, Gus Frazer

It’s always nice to see authors focusing on a business problem – in this case data cleaning – rather than a technology. If you’re looking for an introductory book on Power Query this certainly does the job but the real value here is the way it looks at how to clean data for Power BI using all of the functionality in Power BI, not just Power Query, as well as tools like Power Automate. It’s also good at telling you what you should be doing with these tools and why. Extra credit is awarded for including a chapter that covers Azure OpenAI and Copilot in Dataflows Gen2.

New Semi Join, Anti Join And Query Folding Functionality In Power Query

There are a couple of nice new features to do with table joins (or merges as they are known in M) and query folding in Power Query in the April release of Power BI Desktop that I want to highlight.

Anti Joins now fold

First of all, a few months ago I wrote a post about how the built-in anti join functionality didn’t fold in Power Query. The good news is that it now does on SQL Server-related sources, so no more workarounds are needed. For example, if you have two tables in a SQL Server database called Fruit1 and Fruit2 and two Power Query queries that get data from those tables:

…then the following M code:

let
  Source = Table.Join(
    Fruit1,
    {"Fruit1"},
    Fruit2,
    {"Fruit2"},
    JoinKind.LeftAnti
  )
in
  Source

…returns the following table of fruits that are in the Fruit1 table and not in the Fruit2 table:

Of course that’s what the code above returned in previous versions of Power Query too. The difference now is that query folding occurs and the following SQL code is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where not exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

New join kind: semi joins

There are also two brand new join kind you can use in the Table.Join and Table.NestedJoin functions: JoinKind.LeftSemi and JoinKind.RightSemi. Semi joins allow you to select the rows in one table that have matching values in another table. Using the Fruit1 and Fruit2 tables above, the following M code:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  )
in
  Source

Returns all the rows in Fruit1 where there is a matching value in Fruit2:

Here’s the SQL that is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

The ?? operator now folds

The M language’s ?? coalesce operator is used for replacing null values and this now folds on SQL Server-related sources too now. For example, the M query in the previous section that did a semi join on Fruit1 and Fruit2 returns a table where all the rows in the Fruit2 colum contain null values. The following M query adds a new custom column that returns the text value “Nothing” when the Fruit2 column contains a null:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  ), 
  ReplaceNulls = Table.AddColumn(
    Source, 
    "NullReplacement", 
    each [Fruit2] ?? "Nothing"
  )
in
  ReplaceNulls

Here’s the SQL generated for this, where the ?? operator is folded to a CASE statement:

select [_].[Fruit1] as [Fruit1],
    [_].[Fruit2] as [Fruit2],
    case
        when [_].[Fruit2] is null
        then 'Nothing'
        else [_].[Fruit2]
    end as [NullReplacement]
from 
(
    select [$Outer].[Fruit1],
        cast(null as nvarchar(50)) as [Fruit2]
    from 
    (
        select [_].[Fruit] as [Fruit1]
        from [dbo].[Fruit1] as [_]
    ) as [$Outer]
    where exists 
    (
        select 1
        from 
        (
            select [_].[Fruit] as [Fruit2]
            from [dbo].[Fruit2] as [_]
        ) as [$Inner]
        where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null
    )
) as [_]

[Thanks to Curt Hagenlocher for the information in this post]

Displaying Azure Maps In A Power BI Paginated Report

The built-in mapping functionality in Power BI paginated reports is fairly basic. However the integration of Power Query into Power BI paginated reports gives you an interesting new way of creating maps in paginated reports: you can call the Azure Maps API using Power Query and display the image returned in an Image report item. In this blog post I’ll show you how.

Here’s a quick summary of what I’m going to do:

  • Call the API from https://data.police.uk/ (specifically the Crimes At Location endpoint) using Power Query to get all the recorded crimes within a one mile radius of a given latitude and longitude in a given month for any location in England, Wales or Northern Ireland
  • Take this list of crimes and pass them to the Azure Maps API Get Map Static Image endpoint to return an image of a map with the crime locations on it
  • Display this image in an Image report part in a paginated report

And here’s an example of what the final paginated report will look like:

Step 1: Sign up for the Azure Maps API

In order to call the Azure Maps API you’ll need to go to the Azure Portal and create a resource. The pricing is very reasonable: the first 1000 calls to the endpoint used here are free and after that it’s $4.50 per month for up to 500,000 calls, which should be more than enough for BI purposes.

Step 2: Create Shareable Cloud Connections

To connect to data sources in Power Query in paginated reports you need to create Shareable Cloud Connections in the Power BI portal. You’ll need two connections for this report: one for the Azure Maps API with the URL https://atlas.microsoft.com/map/static/png and one for the Crime API with the URL https://data.police.uk/api/crimes-street/all-crime. Both SCCs should have the authentication method Anonymous and the privacy level Public and have the Skip Test Connection option checked:

Step 3: Create a paginated report and Power Query query to call APIs

After creating a new paginated report in Power BI Report Builder you need to create a dataset (called AzureMap here) to get data from the APIs. This dataset uses Power Query as a source and has one main query (also called AzureMap) and four parameters:

  • lon and lat, to hold the latitude and longitude of the location to get crime data for, which will also be the centre point of the map
  • zoom, which is the zoom level of the map
  • yearmonth, which is the year and month in YYYY-MM format to get crime data for:

Here’s the M code for the query:

let
  CallCrimeAPI = Json.Document(
    Web.Contents(
      "https://data.police.uk/api/crimes-street/all-crime",
      [
        Query = [
          lat  = Text.From(lat),
          lng  = Text.From(lon),
          date = yearmonth
        ]
      ]
    )
  ),
  ToTable = Table.FromList(
    CallCrimeAPI,
    Splitter.SplitByNothing(),
    null,
    null,
    ExtraValues.Error
  ),
  First50 = Table.FirstN(ToTable, 50),
  ExpandColumn1 = Table.ExpandRecordColumn(
    First50,
    "Column1",
    {"location"},
    {"location"}
  ),
  Expandlocation = Table.ExpandRecordColumn(
    ExpandColumn1,
    "location",
    {"latitude", "street", "longitude"},
    {
      "location.latitude",
      "location.street",
      "location.longitude"
    }
  ),
  JustLatLon = Table.SelectColumns(
    Expandlocation,
    {"location.longitude", "location.latitude"}
  ),
  TypeToText = Table.TransformColumnTypes(
    JustLatLon,
    {
      {"location.longitude", type text},
      {"location.latitude", type text}
    }
  ),
  MergedColumns = Table.CombineColumns(
    TypeToText,
    {"location.longitude", "location.latitude"},
    Combiner.CombineTextByDelimiter(
      " ",
      QuoteStyle.None
    ),
    "LongLat"
  ),
  PrefixPipe = Table.TransformColumns(
    MergedColumns,
    {{"LongLat", each "|" & _, type text}}
  ),
  GetString = "|"
    & Text.Combine(PrefixPipe[LongLat]),
  QueryRecord = [
    #"subscription-key"
      = "InsertYourSubscriptionKeyHere",
    #"api-version" = "2022-08-01",
    layer = "basic",
    style = "main",
    #"zoom" = Text.From(zoom),
    center = Text.From(lon) & ", " & Text.From(lat),
    width = "768",
    height = "768"
  ],
  AddPins = try
    Record.AddField(
      QueryRecord,
      "pins",
      "default|sc0.5" & GetString
    )
  otherwise
    QueryRecord,
  CallAzureMapsAPI = Web.Contents(
    "https://atlas.microsoft.com/map/static/png",
    [Query = AddPins]
  ),
  ToText = Binary.ToText(
    CallAzureMapsAPI,
    BinaryEncoding.Base64
  ),
  OutputTable = #table(
    type table [image = text],
    {{ToText}}
  )
in
  OutputTable

You need to put all this code in a single M query to avoid the Formula.Firewall: Query ‘Query1’ (step ‘xyz’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination error. You can find out more about this error by watching my data privacy video here.

A few things to note:

  • The CallCrimeAPI step calls the Get Crimes At Location API endpoint to get all the reported crimes within a one mile radius of the given latitude and longitude in the given year and month.
  • Because of the way I’m sending the crime location data to the Azure Maps API I limited the number of locations to 50, in the First50 step, to avoid hitting errors relating to the maximum length of a URL.
  • The GetString step returns a pipe delimited list of longitudes and latitudes of crime locations for the Azure Maps API to display as pins on the map. However, some error handling is needed in case there were no reported crimes in the given location or month and that happens in the AddPins step.
  • The QueryRecord step contains all the parameters to send to the Azure Maps Get Map Static Image endpoint. This docs page has more information on what’s possible with this API – I’m barely scratching the surface of what’s possible in this example.
  • Authentication to the Azure Maps API is via a subscription key which you’ll need to pass to the subscription-key parameter. You can get the key from the resource created in step 1 in the Azure Portal.
  • The API returns an image binary which is converted to text and returned in a table with one column and one row in the ToText and OutputTable steps. The code is similar to what I showed in this blog post but luckily I didn’t seem to need to break it up into multiple rows.

Step 4: Create Power Query query to return values for Zoom parameter

The Zoom parameter of the Get Map Static Image API endpoint accepts a value between 0 and 20, which represents the zoom level of the displayed map. You need to create a separate dataset and M query to return a table containing those values with the following code:

let
  Source = {0 .. 20}, 
  #"Converted to table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ), 
  #"Changed column type"
    = Table.TransformColumnTypes(
    #"Converted to table", 
    {{"Column1", Int64.Type}}
  ), 
  #"Renamed columns" = Table.RenameColumns(
    #"Changed column type", 
    {{"Column1", "Zoom"}}
  )
in
  #"Renamed columns"

Step 5: Create paginated report parameters

Next you need to create four parameters in the paginated report for the longitude, latitude, zoom level and year month:

To make it easy for end users to select a zoom level, you need to bind the available values for the zoom parameter to the table returned by the dataset from the previous step:

Step 6: Display the map in an Image report part

In the paginated report itself the only interesting thing is the configuration of the Image report part in the centre of the report:

You need to set the image source to “Database”, bind it to the following expression

=First(Fields!image.Value, "AzureMap")

…which gets the text value from the sole row and column in the table returned by the AzureMap dataset created in step 3, and set the MIME type to be “image/png”.

And that’s it! After publishing you can enter any latitude and longitude in England, Wales or Northern Ireland, a year and month, and a zoom level, and get all the reported crimes on a map:

You can download the .rdl file with the paginated report in here (remember to edit the AzureMaps query to insert your Azure Map API key).

Power BI Paginated Reports That Connect To Web Services And Excel

By far the most exciting announcement for me this week was the new release of Power BI Report Builder that has Power Query built in, allowing you to connect to far more data sources in paginated reports than you ever could before. There’s a very detailed blog post and video showing you how this new functionality works here:

https://powerbi.microsoft.com/en-us/blog/get-data-with-power-query-available-in-power-bi-report-builder-preview

The main justification for building this feature was to allow customer to build paginated reports on sources like Snowflake or BigQuery, something which had only been possible before if you used an ODBC connection via a gateway or built a semantic model in between – neither of which are an ideal solution. However it also opens up a lot of other possibilities too.

For example, you can now build paginated reports on web services (with some limitations). I frequently get asked about building regular Power BI reports that get data from web services on demand – something which isn’t possible, as I explained here. To test using paginated reports on a web service I registered for Transport for London’s APIs and built a simple report on top of their Journey Planner API (Transport for London are the organisation that manages public transport in London). This report allows you to enter a journey starting point and ending point anywhere in or around London, calls the API and returns a table with different routes from the start to the destination, along with timings and instructions for each route. Here’s the report showing different routes for a journey from 10 Downing Street in London to Buckingham Palace:

You can also build paginated reports that connect to Excel workbooks that are stored in OneDrive or OneLake, meaning that changes made in the Excel workbook show up in the report as soon as the workbook is saved and closed:

So. Much. Fun. I’ll probably develop a presentation for user groups explaining how I built these reports soon.

And yes, if you need to export data to Excel on a schedule, paginated reports are now an even better choice. You know your users want this.

Best Practices For Power BI On Databricks Webinar

I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:

https://www.linkedin.com/video/live/urn:li:ugcPost:7174102151407779841

…or on YouTube here:

My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.

How Much Does Copilot Cost In Microsoft Fabric?

There’s a lot of excitement about Copilot in Power BI and in Fabric as a whole. The number one question I’m asked about Copilot by customers is “How much does it cost?” and indeed there have been two posts on the Fabric blog here and here attempting to answer this question. The point I want to make in this post, though, is that it’s the wrong question to ask.

Why? Well to start off with Copilot isn’t something you buy separately. Every time you use Copilot in Fabric it uses compute from a capacity, either a P SKU or an F SKU, just the same as if you opened a report or refreshed a semantic model. You buy the capacity and that gives you a pool of compute that you can use however you want, and using Copilot is just one of the things you can use that compute for. No-one ever asked me how much it cost in dollars to open a report in Power BI or refresh a semantic model, so why ask this question about Copilot?

Of course I understand why Copilot feels like a special case: customers know a lot of users will want to play with it and they also know how greedy AI is for resources. Which brings me back to the point of this post: the question you need to ask about Copilot is “How much of my Power BI/Fabric capacity’s compute will be used by Copilot if I turn it on?”. Answering this question in terms of percentages is useful because if you consistently go over 100% usage on a Fabric capacity then you will either need to buy more capacity or experience throttling. And if Copilot does end up using a lot of compute, and you don’t want to buy more capacity or deal with throttling, then maybe you do want to limit its usage to a subset of your users or even turn it off completely?

To a certain extent the question about percentage usage can be answered with the Premium Capacity Metrics app. For example, I opened up a gen2 dataflow on a Premium capacity that returns the following data:

I expanded the Copilot pane and typed the following prompt:

Filter the table so I only have products that were produced in the country France

And here’s what I got back – a filtered table showing the row with the Producer Country France:

So what impact did this have on my capacity? Since I know this was the only time I used Copilot the day I wrote this post it was easy to find the relevant line in the “Background operations for timerange” table on the TimePoint Detail page of the Premium Capacity Metrics app:

There are two important things to note:

  • For the 30-second window I selected the operation above used 0.02% of my capacity
  • Copilot operations are classed as “background operations” so their cost is smoothed out over 24 hours. Therefore the operation above used 0.02% of my capacity for 24 hours from a few minutes after the point I hit enter and the operation ran; in this particular case I ran my test just before 16:20 on Friday March 15th and the operation used 0.02% of my capacity from 16:20 on Friday the 15th until 16:20 on Saturday March 16th.

How can you extrapolate tests like this to understand the likely load on your capacity in the real world? With great difficulty. Almost all the Fabric/Power BI Copilot experiences available today are targeted at people developing rather than just consuming, so that naturally limits the opportunities people have to use Copilot. Different prompts will come with different costs (as the blog posts mentioned above explain), a single user will want to use Copilot more than once in a day and you’ll have more than one user wanting to use Copilot. What’s more, going forward there will be more and more opportunities to use Copilot in different scenarios and as Copilot gets better and better your users will want to use it more. The compute usage of different Fabric Copilots may change in the future too. So your mileage will vary a lot.

Is it possible to make a rough estimate? Let’s say you have 20 developers (which I think is reasonable for a P1/F64 – how many actively developed solutions are you likely to have on any given capacity?) writing 20 prompts per day. If each prompt uses 0.02% of your capacity then 20*20*0.02%=a maximum of 8% of your capacity used by Copilot for the whole day. That’s not inconsiderable and I’m sure someone will leave a comment saying I’m underestimating what usage will be.

Which brings me to my last point: should you even see Copilot as being different from anything else you can do in Fabric that consumes compute? Or as an optional extra or additional cost? After all, dataflows consume compute, you can enable or disable dataflows in your tenant in the same way you can enable or disable Copilot, but very few customers disable dataflows because they see the benefit of using them. Turning off dataflows would reduce the load on your capacity but it would also stop your users from being so productive, and why would you do that? If we at Microsoft deliver on the promise of Copilot (and believe me, we’re working hard on this) then the productivity gains it brings to your developers should offset the cost of any extra capacity you need to buy – if indeed you need to buy any extra capacity.

So, to sum up, if you enable Copilot in Fabric you will see additional load on your capacities and you may need to buy more capacity as a result – but the benefits will be worth it. Predicting that additional load is difficult but it’s no different from predicting how your overal Fabric capacity usage will grow over time, as more and more reports, semantic models, notebooks, warehouses and so on get created and used. Rather than doing lots of complex calculations based on vague assumptions to try to predict that load, my advice is that you should use the Capacity Metrics app to monitor your actual usage and buy that extra capacity when you see you’re going to need it.

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.

What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…shows the relationship columns:

Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:

…you get the following M code with the CreateNavigationProperties property set to false:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…and you don’t see those extra columns.

How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.

[Thanks to Curt Hagenlocher for the information in this post]