I have always wondered what the partialBatch option for the commitMode parameter in the Enhanced Refresh API does exactly. There is some documentation here and here but I was curious to find out more as part of the research I’m doing for my ongoing series on Power BI refresh memory errors, in case it was useful for reducing overall memory usage (spoiler: it may be). In this post I’ll share what I found out after running some tests.
For my testing I created a semantic model with four tables called A, B, C and D. Each semantic model table used an M expression similar to the following as its source:
Function.InvokeAfter(
() => #table(type table [MyCol = text], {{"A"}}),
#duration(0, 0, 0, 10)
)
This expression – the expression for table A – returns a table with one column and one row after a delay of 10 seconds; the expressions for tables B, C and D were almost identical but had delays of 20, 30 and 40 seconds respectively. These delays ensured that when the tables refreshed they always finished in a set order if they were refreshed in parallel. There were no relationships or measures in the model.
I published this model and then created a notebook to refresh it using Semantic Link’s refresh_dataset method (hat tip to Phil Seamark, whose code I stole) which uses the Enhanced Refresh API behind the scenes:
import sempy.fabric as fabric
WorkspaceName = "CW partialBatch Tests"
SemanticModelName = "partialBatchTest"
# run the refresh
request_status_id = fabric.refresh_dataset(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full", commit_mode="partialBatch", retry_count=1, max_parallelism=4)
print("Progress:", end="")
while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break
print("░", end="")
time.sleep(2)
print(": refresh complete")
Note that the commit_mode parameter is set to partialBatch, refresh_type is set to full and that, at this point, max_parallelism was set to 4, which is the number of tables in the model and which meant that all four tables refreshed in parallel.
I ran a Profiler trace while running this code to refresh the model and observed the following happening:
- First, a refresh of type ClearValues (which clears all data from a table) was run in a single transaction for all four tables. See here for more details about each type of refresh.
- Next a refresh of type DataOnly (which loads the data into a table but nothing more) was run in a single transaction for all four tables.
- Finally, a refresh of type Calculate (which builds things like calculated tables, calculated columns, hierarchies and relationships) was run in a single transaction for all four tables. After this, all four tables were fully refreshed.
There are two important things to note about this:
- By running a refresh of type ClearValues first, to clear all the data out of the model, the memory consumption of the model is reduced to almost nothing. The downside of this is that the model is no longer queryable until the next two refreshes complete.
- Running two separate refreshes of type DataOnly and Calculate, rather than a single refresh of type Full, also reduces peak memory usage although it will also probably be slower.
For my second test I changed the definition of table C so that there was a 50% chance it returned an error when refreshed if an M parameter called FailRefresh was set to “Y”:
if FailRefresh = "Y" and Number.Random() > 0.5 then
Function.InvokeAfter(
() =>
error Error.Record("Forced Refresh Failure"),
#duration(0, 0, 0, 30)
)
else
Function.InvokeAfter(
() => #table(type table [MyCol = text], {{"C"}}),
#duration(0, 0, 0, 30)
)
I then published the model again, changed my notebook code so that the retry_count parameter of refresh_dataset was set to 2 (see here for more detail about refresh retries), and ran the refresh until I got an instance where table C’s refresh failed the first time but succeeded on the retry. I observed the following:
- First, a refresh of type ClearValues was run in a single transaction for all four tables as before.
- Next, a refresh of type DataOnly was run in a single transaction for all four tables. Although the refreshes for tables A and B, which took 10 and 20 seconds, completed successfully the transaction failed because the refresh for table C returned an error after 30 seconds. The refresh for table D, which would have taken 40 seconds, did not complete.
- Next, because retry_count was set to 2, the refresh of type DataOnly was run again for all four tables. This succeeded. Note that it did not try to rerun the ClearValues refresh again and that tables A and B were refreshed all over again.
- Finally a refresh of type Calculate was run in a single transaction for all four tables, as before.
For my final test I kept everything the same except for setting the max_parallelism parameter to 1. I then refreshed until I once again found a case where table C’s refresh failed first time but succeeded on the retry. This time I observed the following:
- First, a refresh of type ClearValues was run in a single transaction for all four tables as before except with a MaxParallelism of 1.
- Next, a refresh of type DataOnly was run for just table A in its own transaction. It succeeded.
- Next, a refresh of type DataOnly was run for just table B in its own transaction. It succeeded.
- Next, a refresh of type DataOnly was run for just table C in its own transaction. It failed.
- Next, a refresh of type DataOnly was run for just table C in its own transaction. It succeeded, and this represents the start of the retry.
- Next, a refresh of type DataOnly was run for just table D in its own transaction. It succeeded.
- Finally a refresh of type Calculate was run in a single transaction for all four tables, as before except with a MaxParallelism of 1.
This is interesting because, in contrast with the previous test, when table C failed and the retry started, tables A and B did not get refreshed a second time.
What can we learn from this?
- The way the partialBatch commit mode splits up a refresh into three separate refreshes of type ClearValues, DataOnly and Calculate will reduce the overall peak memory consumption during a refresh, which is useful if your refresh is failing because you are using too much memory.
- It will result in slower refresh performance overall and the model will not be queryable for most of the refresh, however.
- If one of the tables in your model fails to refresh then reducing the amount of parallelism and setting a retry_count of more than 0 may mean that the tables that did refresh successfully first time around may not need to be refreshed again. Whether this results in a faster overall refresh though depends on whether not needing to refresh tables again counteracts the effect of less parallelism.
Overall I think using the partialBatch commit mode may be useful if you need to reduce memory usage during refresh or have very unreliable data sources. However if you need this type of fine grained control over your refreshes you’re probably better off writing more complex code (for example using a Fabric notebook as I did here) to control exactly what gets refreshed and how, rather than just making a single call to the Enhanced Refresh API using this option.
[Thanks to Jast Lu for answering my questions about this topic]