Looping through data using PySpark notebook in Fabric
Fabric Notebooks – Looping through data using PySpark
Continuing with my existing blog series on what I’m learning with notebooks and PySpark.
Today, I’m going to explain to you how I found a way to loop through data in a notebook.
In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.
Please note that I am using the option to Open in VS Code (Web) (PREVIEW), so that I can work using VS code in the web.
The first thing I did in my notebook was to load Sempy and some configuration settings.
Getting the dataset to loop through
As explained in Ma, working example, I am going to loop through a set of dates which I am getting from Lakehouse table.
The way I am getting this data is to use the PySpark SQL as shown below.
The reason for my query in this format is because my DirectLake semantic model only has dates for the periods shown below.
Also, when passing through dates to DAX it needs to be split into Year, Month and Day.
When I run the above cell, I can then see my output as shown below (This is to ensure that I am getting the expected output)
Code to loop through the data
Below is the final step where I am now going to show you how I looped through the data from the dataframe df_Dates_Query
I have put comments in the code which will explain the different steps I completed.
One thing I want to call out is that when looping through all the rows in the dataset, it must have the indentation (TAB) across to reference each row in the dataset.
##### Looping through the Dates ##### from pyspark.sql import SparkSession from pyspark.sql.functions import lit import datetime from pyspark.sql.types import IntegerType,BooleanType,DateType from pyspark.sql.functions import col, year, month, quarter from pyspark.sql.functions import * from datetime import datetime,date,timedelta import pandas as pd # Workspace and Semantic Model Name WorkspaceName = "Fabric - FourMoo" SemanticModelname = "Near Realtime Semantic Model" # Create a Spark session spark = SparkSession.builder.appName("AddColumnExample").getOrCreate() # Get the dataframe I want to loop through. # I used the method "collect()" to use it as my looping function. # My variable is called "data_Collect" # I am then using my previous dataframe called "df_Dates_Query" data_collect = df_Dates_Query.collect() # If needed I can see the dataframe I am going to loop through for some validation. #display(data_collect) # looping thorough each row of the dataframe for row in data_collect: # NOTE: There has to be a tab in order for each row in the dataset to be used # This is where I am assigning a variable name for each row in my looping dataset. var_Date = row["Date"] var_Year = row["YearValue"] var_Month = row["MonthValue"] var_Day = row["DayValue"] # Below is my DAX Query String querytext = '''\ // DAX Query EVALUATE SUMMARIZECOLUMNS( SQL_Mirror_Sales[InvoiceDateKey], KEEPFILTERS( TREATAS( {DATE(''' + var_Year + ''', ''' + var_Month + ''', ''' + var_Day + ''')}, SQL_Mirror_Sales[InvoiceDateKey] )), "Sales", [Sales (Mirror)] ) ''' # Running the DAX Query df_DAX_QueryResult = fabric.evaluate_dax(workspace=WorkspaceName, dataset=SemanticModelname, dax_string=querytext) # Get the Output of the DAX Query display(df_DAX_QueryResult)
Here is the output from my DAX query, where you can see I am looping through each of the dates highlighted below.
If I did not have the indentation (TAB) and I try and run the code I get the following as shown below.
Summary
In this blog post I have shown you how I loop through data in a PySpark notebook.
Thanks for reading, comments and suggestions are most welcome 😊
[…] Gilbert Quevauvilliers builds a loop: […]