Using Power BI DirectLake in Microsoft Fabric
I am super excited to release this blog post and I cannot wait to start using Microsoft Fabric and all the different components.
The goal of my blog post is to look at how I could use Microsoft Fabric from a Power BI developers point of view.
Microsoft Fabric is a new all in one platform which includes Power BI. It is a single area where you can now complete all your data requirements without having the overhead of having to configure and manage multiple components. It is also possible to use the different components with a seamless experience.
As you can see below this is how Power BI Direct Lake works where it does not need to import the data into Power BI but can read the data directly from the OneLake.
In this blog post below, I am going to show you how I got this working.
My approach is coming from a Power BI user where I am familiar with using the no code or low code approach to getting things done.
In my example below there is data that is being inserted into an Azure SQL database every minute.
I am then going to show you how I use the new features in Fabric to get the entire process flow working.
I will be using the following features:
-
Lakehouse
- This is where all the data will be stored
-
Dataflow Gen2
- This is the new version of dataflows which uses Power Query Online.
-
Direct Lake dataset
- This is the Power BI dataset which will be getting the data directly from the Lakehouse without having to import the data.
-
Power BI reports
- This is where I create my Power BI reports
-
Data Pipelines
- The data pipeline is what I use to refresh or load my data.
- Power Automate (Outside of Fabric but still used to refresh the Power BI dataset)
Creating the Lakehouse
The first step in the process is for me to create a new Lakehouse.
The Lakehouse is where all the files and tables will be stored.
When I think about the Lakehouse is I think of files as the source files just being loaded in their raw form from their source.
While the tables in Lakehouse are the parquet files that have been processed via Dataflow Gen2 or Spark Notebooks.
I navigate to my App Workspace which is not configured as a Microsoft Fabric App Workspace.
Then in the bottom left corner I click on the Power BI Icon and select Data Engineering
Then at the top under new I clicked on Lakehouse.
I then put in my name of LH_FourMoo and clicked Create.
Once created I could see my Lakehouse as shown below.
I now had created my Lakehouse
Creating my Dataflow Gen2
The next step was for me to create my dataflow Gen2 which is going to get data from my Azure SQL Server database and load the data directly into the Lakehouse table.
I clicked on New Dataflow Gen2.
You will notice that this looks very similar to Dataflows.
I then clicked on Import from SQL Server.
I then put in all the required credentials for the server, database and clicked on Sign In to authenticate. And once signed in I clicked on Next.
I then searched for my table “tb_TriedentData”, selected the table and clicked on Create.
I then added a new column called “Blog Post” to the Dataflow Gen2 as shown below.
Before I continue when loading data into the Lakehouse all columns need to have a data type defined.
If they do not, like it currently stands for my new column called “Blog Post” if I tried to load the data into the Lakehouse I would get the following shown below, where the data type of ANY cannot load the data into the Lakehouse.
I went to my column called “Blog Post” and set the data type to Text as shown below.
Another thing to note is that tables going to the Lakehouse cannot have any spaces, so I also needed to rename the column to Blog_Post. If it had a space I would the same screen as previously shown where the column would not be part of the table loaded into the Lakehouse.
The next step I had to configure is to add in a data destination, which is where I want the data to go to.
In my example I want to put this data into my Lakehouse called “LH_FourMoo”, to do that in the ribbon I clicked on Home and then Add Data destination and selected Lakehouse as shown below.
It should then detect that it is going to be a Lakehouse connection as shown below. I then clicked Next.
Next, I need to select the location of the Lakehouse where I want to create my new Lakehouse table. As shown below, I selected the Lakehouse I created earlier.
I also selected to create a new Table and gave it the table name of “tb_TridentData”.
I then clicked on Next.
On the screen “Choose destination settings” I then chose the update method to be replace and in the column mapping I could see that all my columns would be successfully mapped.
I then clicked on Save Settings, which then saves all my Data destination details.
I could now see the data destination details on the bottom right-hand side of my Dataflow Gen2.
I then clicked on Publish to publish my Dataflow Gen2
Currently by default what it does is it saved the dataflow with a default name as shown below.
Also, once I clicked on publish I need to wait for the publishing of the dataflow to complete.
Please NOTE that this is the initial publishing of the dataflow no data has been inserted.
To rename the dataflow I hovered over to select the three dots and then selected “Properties”
I then renamed it to “DF – Azure SQL Data” and clicked on Save.
I then clicked on the Refresh Now button, which will now load the data into my Lakehouse table.
I could then see that the Dataflow Gen2 has completed successfully.
To confirm this I then went into my Lakehouse “LH_FourMoo”
I could then see my table was in the Lakehuse under Tables as shown below.
I can also confirm that it is in the required Delta/Parquet format which is used to create tables in the lakehouse (notice this little triangle in the table icon + _delta_log folder in the file hierarchy behind this table). If it was just Parquet, we couldn’t benefit from Direct Lake mode. The Delta/Format file format for the DirectLake by Clicking on the three dots next to “tb_Trident” and selecting “View Table files”
I could then see all the files for my table
Another thing to note, there is an awesome new feature where in the left hand side I can see my App Workspace and below that I also can see my Lakehouse that I have got open.
If I had to then open the LH_FourMoo SQL Endpoint I would then see that also open in the left-hand side.
I now have completed creating my Dataflow Gen2, publishing it and loading the data into the Lakehouse in the required Parquet format.
Creating a DirectLake Power BI Dataset
Now this is the most amazing part where I am going to create my DirectLake dataset, where I will not have to import the data into Power BI, but rather simply create my dataset directly from the tables in the Lakehouse.
To do this I first need to go into my Lakehouse “LH_FourMoo”
Once I was in the Lakehouse I then clicked on New Power BI Dataset
I then selected my table “tb_TridentData” and clicked Continue
I could then see my dataset being created.
Once it was created, I could then see the dataset
The first thing I wanted to do was to rename it from the default dataset name of “L_FourMoo (1)” on the top left.
I clicked on the name and updated it as shown below.
To save the new name I clicked on the name at the top again.
Creating my Power BI Report from my dataset
The next step was for me to create a Power BI Report.
On my dataset, I clicked on Create a report and from the drop down selected “Start from scratch”
Because this dataset is not complex I then created a single visuals with the details. In doing so this will show me if the data is being updated once I string it all together. Below is what the Power BI report looks like.
I then saved my report as shown below.
Creating a data pipeline to refresh the dataflow and Power BI Dataset
The final piece was for me to create a data pipeline which would refresh the data of my dataflow gen2 and then also refresh the Power BI dataset.
Whilst I am aware that the Power BI Dataset is using the DirectLake I still need to refresh the dataset, which will then do a metadata refresh allowing me to view the updated data directly from the parquet files.
Now before completing my data pipeline, I had already created a Power Automate flow which will refresh the Power BI dataset which will begin with “When a HTTP trigger is received” and then refreshes the Power BI Dataset.
Below is an overview of what I created using Power Automate, I do not go into details of how to complete this as I feel it is out of scope of the current blog post.
On the bottom left-hand side, I made sure that I had selected “Data Engineering”
I then clicked on Data pipeline.
This then brought up the new pipeline window, I put in the name of “DP – Azure SQL” and clicked on Create.
I then clicked on Add pipeline activity.
I then selected Dataflow.
In the General Tab I then renamed it from “Dataflow 1” to “DF – Azure SQL”, and left all the other defaults
I then clicked on Settings, clicked on the drop down and selected “DF – Azure SQL Data” as shown below.
I then clicked on Activities in the ribbon and selected “Web”
Then in the General settings I gave it the name of “PA – Refresh URL” and left the other defaults.
I then clicked on Settings and put in the HTTPS URL from my Power Automate flow I had shown previously, selected the Method to “GET” and left the defaults.
My final step was to drag the “On Success” from the Dataflow to the Web activity, which means that I first want to refresh the dataflow (which will bring in the updated data) and then refresh the Power BI dataset (via the Power Automate flow)
I then clicked on Save to Save my Pipeline Activity.
The final step was to then run the pipeline to make sure it all works as expected, I clicked on Run.
I could then see in the Output the details of my pipeline while running.
And then once it had successfully completed I could then confirm it in the output window as shown below.
When I then went to my Power BI report, I could then see that the updated report with the details.
Summary
In this blog post I have demonstrated step by step on how to use Microsoft Fabric allowing you to create a Direct Lake for Power BI Datasets and Power BI Reports.
I am really excited about Microsoft Fabric and what it has to offer.
Please let me know if there is any other content that you would me to blog about and I will see what I can do.
[…] Gilbert Quevauvilliers shows us how to enable Direct Lake mode […]
My understanding is that with Direct Lake mode, you do NOT have to explicitly refresh the Power BI dataset. It always shows the data in the lakehouse?
Hi Donald,
Yes that is correct, as far as I know it should do that, in my preview I had to refresh it.
Hi Donald / Gilbert,
I was also intrigued by the Power BI Refresh step. I am wondering if it is still necessary or at least possible. Here is what I was thinking while considering Glibert example.
When the DeltaLake table is updated, a new GUID folder is created. In Gilbert screenshot there are two. The empty table and the first run table. Could the Power BI refresh be required to know which delta table to use. In other word, Power BI refresh find the last folder GUID, and keeps using this folder until a new refresh tells Power BI to use the latest delta lake version.
Thanks for the comment Daryl, I think that is exactly what is happening!
In the same time, the Date data type can´t be loaded in the lakehouse only Date Time type. do you known why?
Hi Frank,
This is because for parquet files they only have the data type of datetime.
Excellent article Gilbert.
I am investigating a use-case to achieve “real-time” reporting in Power BI. Whereby when the source transactional db is updated it triggers “something” to pull the latest data into a Direct Lake dataset. What I am trying to establish is how best to achieve that. We don’t yet have a DW and I am the only data person. Currently I use views off the transactional system (Azure SQL Db) to pull into Power BI dataset. Not ideal to keep doing this long term so I was hoping with Fabric I could start to create a more robust, scaleable solution to pull data out either into LH or DW. I am not entirely sure if I need Data Factory copy OR a Dataflow as per your example above, and then not sure how to auto-detect change in source to trigger the refresh rather than on a schedule basis. Any suggestions welcome 🙂
Hi Bee,
What you could do is as I did, create a dataflow which will add the data to the lakehouse, after which you could create your DirectLake Power BI report.
Next you can then create a data factory pipeline which when run will refresh the dataflow.
As part of the dataflow you can create a webhook, which you could call from your transactional db source (Possibly using SSIS)
Let me know if you need any more assistance.
Thanks
Gilbert