The scenario here is that quite often there is a requirement to only keep data from a specific start date, or where it should be keeping data for the last N number of years (which is the first day in January).

Currently in Power BI using the default Incremental refresh settings this is not possible. Typically, you must keep more data than is required.

It is best illustrated by using a working example.

The requirement is to always keep the last 2 years plus current years’ worth of data, but it must be for my fiscal year, which in Australia starts on 01 July every year.

In this blog post I am going to show you how to do this using Semantic-Link-Labs

This means that if today’s date is 2024-07-09 I need to keep data from 2022-07-01 onwards.

To do this in Power BI Desktop, it could be done with the following as shown below.

This is what happens if the incremental refresh policy is kept as shown above.

  1. The first month it keeps all the data as expected.
  2. It drops off the data for the month of July 2024
  3. It drops off the month for Aug 2024, because of number 2, it has already dropped off July 2024.
  4. This keeps on happening each month, so when I get to the end of my fiscal year, I am missing data from July 2024 to May 2024.

The workaround to this, is to then keep data for additional months, which means I need to change my Incremental Refresh policy to keep data for the past 36 Months as shown above in Number 4.

A screenshot of a computer

Description automatically generated

While this does work, it means that I am storing more data than required. There are a lot of potential implications with this such as larger model sizes, potentially affecting performance, people getting data that they should not see etc..

The ideal solution is to update the Incremental refresh policy each month by updating the Month as time moves forward. But still adhering to the fiscal start date.

This is so that when the new fiscal year starts, it will then modify the months back to 24 and build it back up to 36 months over time.

How does it work

When the Incremental Refresh Policy gets updated using the code below, all that it is doing is updating the schema or metadata.

Then the next time the refresh happens, the refresh then looks at the schema and determines that needs to be done and completes the refresh.

In my working example, when the months change from 24 to 25 months, when the refresh of the semantic model takes place, it is smart enough to know that there is already a partition for the months 24 and 25 and leaves them as is. Nothing changes and nothing is processed.

NOTE: I would highly recommend that you schedule this process to run just after midnight UTC! This is because all the Power BI Servers are set to UTC and if you had to run it before midnight it could potentially drop off a month.

I typically schedule it for 12:10AM

Using Semantic-Link-Labs to build the solution

The good news is that Michael Kovalsky from elegantbi.com has created the Semantic-Link-Labs which allows this to be achieved using a Notebook.

I first installed the Semantic-Link-Labs

# Install Semantic-Link-Labs
%pip install semantic-link-labs

Below is the code that I used to connect to my semantic model where I have got my Incremental Refresh policy applied.

# Connect to my Semantic Model
import sempy_labs as sempy_labs
from sempy_labs.tom import connect_semantic_model

dataset = 'Incremental Refresh Point in Time' # Enter dataset name
workspace = 'Power BI - DRLS'

with connect_semantic_model(dataset=dataset, readonly=True, workspace=workspace) as tom:
    for t in tom.model.Tables:
        print(t.Name)
     
I then ran the following code to see what my current Incremental Refresh Policy was set to
# Show Current Incremental Refresh Policy
with connect_semantic_model(dataset=dataset, readonly=False, workspace=workspace) as tom:
    tom.show_incremental_refresh_policy(table_name = 'Fact Sales')

I then ran the following code to see what my current Incremental Refresh Policy was set to

# Show Current Incremental Refresh Policy
with connect_semantic_model(dataset=dataset, readonly=False, workspace=workspace) as tom:
    tom.show_incremental_refresh_policy(table_name = 'Fact Sales')

And here is the result

A black text on a white background

Description automatically generated

Now the next piece of code is where I am defining my start date as ‘2022-07-01’. I could also create some more detailed logic here I get the current date going back to always be the start of the Fiscal Year for the last 2 years + Current Year.

For this example, it is easier for me to just hard-code the date.

Also to be honest it took me a bit of time to then get the result from the SQL query and return it to a variable with the single value. The good news is that you don’t have to figure it out, you can just copy my code 😊

# Calculate the Rolling Period months based on current date
import pyspark.pandas as ps

# Example SQL query
query = "Select DATEDIFF(MONTH,'2022-07-01',CURRENT_DATE) as RollingMonths"

# Execute the query and return the result as a Pandas-on-Spark DataFrame
result_df = ps.sql(query)

# Get the result from the single column called "RollingMonths" into a variable
single_value = result_df['RollingMonths'].values[0]

print(single_value)   

Here is the result of 24 as shown below.

A black and white text

Description automatically generated

The final step was to then take this value and use it to update my Incremental Refresh Policy, using the following code below (Once again it took me a while to figure out how to get the single_variable into the update_incremental_refresh_policy 😉

# Apply new Incremental Refresh Policy

with connect_semantic_model(dataset=dataset, readonly=False, workspace=workspace) as tom:
    # Add in Single Variable below 
    Rolling_Window = int(single_value)
    tom.update_incremental_refresh_policy(table_name = var_TableName , incremental_granularity = 'Day', incremental_periods = 4
    , rolling_window_granularity = 'Month', rolling_window_periods = Rolling_Window, only_refresh_complete_days = None 
    , detect_data_changes_column =  None )

Once the above code ran, I could then see that it had updated Incremental Refresh Policy

A black text on a white background

Description automatically generated

Now, the next time my semantic model gets refreshed either via the schedule or the API it will apply the Incremental Refresh Policy.

Summary

In this blog post I have shown you how to keep the required data in your semantic model and to ensure that you can automate this process so that you only store the data required.

Thanks for reading, comments and suggestions are most welcome.

You can get a copy of my notebook here: Blog – Update IR Policy