Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability
In this blog post I am going to compare Dataflow Gen2 vs Notebook in terms of how much it costs for the workload. I will also compare usability as currently the dataflow gen2 has got a lot of built in features which makes it easier to use.
The goal of this blog post is to understand which in my opinion is cheaper and easier to use, which will then be the focus for future blog posts with regards to what I’ve learned along the way, which will hopefully assist you too.
To compare between the two workloads, I am going to be using the same source file as well as do the same transformations which will result in the same result.
For my examples below I am using an F2 Capacity in Australia Southeast which is $0.42/hour. The pricing information can be found here: Microsoft Fabric – Pricing | Microsoft Azure
In both workloads, all that I’m going to do is to add a new column called date num, which will be the date represented in a number format.
As you can see below, this is how I did it in the dataflow Gen 2.
This is the expected output from the Dataflow Gen2
And this is how I did it when using the notebook to achieve the same results. Please note that the notebook code I did not do this myself. I used the data Wrangler feature to get the desired output I have. Blocked about this previously which you can find the link here:
This is the expected output from the Notebook code.
Comparing the Costs
One of the critical things to validate when using Fabric workloads is how much does this cost.
I am still learning how to calculate the Fabric costs, so this is subject to change. And when this does, I will then update this blog post!
I use the Fabric metrics app to look at both the capacity units consumed and the duration to calculate to calculate the costs of each of the workloads.
As I will explain below, you can see how I determined the cost, for each workload to decide in terms of comparing the costs which workload is cheaper to use.
Below shows the CU(s) and Duration (s) from my two workloads.
I have then calculated the billing costs based on the billing example from the Microsoft documentation: Data warehouse billing and utilization reporting – Microsoft Fabric | Microsoft Learn
Dataflow Gen2 Cost
The duration for the Dataflow Gen2 was 97.18 seconds, so to calculate the cost I did the following below.
Because I am using an F2 capacity the price per CU is $0.42/hour
The cost calculation is then (97.18 x 0.42)/3600 = $0.0113
NOTE: The reason to divide by 3600 is because there are 3600 in an hour. Because our costing is hourly, I must then divide it by 3600 to get the cost per second.
Notebook Cost
The duration for the Notebook was 84.81 seconds, so to calculate the cost I did the following below.
Because I am using an F2 capacity the price per CU is $0.42/hour
The cost calculation is then (84.81 x 0.42)/3600 = $0.0098
Bursting Costs
It is known that bursting can occur when using Fabric workloads. This must potentially be factored in when calculating the costs for workloads when using Fabric.
Thanks to Tristan Malherbe for his assistance to how to calculate the bursting costs.
If I had to make up a bursting example where there was an additional 2000 CU(s) of bursting this is how I would calculate the additional bursting costs.
For an F2 capacity I would calculate the cost for 1 CU to be $0.00005833
This was calculated as: Per Minute CU cost / (CU(s) Capacity * 60 Seconds)
Actual Calculation: $0.007 / (2 * 60) = $0.00005833
The final cost for the additional 2000 CU(s) would be: 2000 * $0.00005833 = $0.1167
Cost Comparison
To compare the costs, I then work out the difference in cost between the Dataflow Gen2 and the Notebook.
This works out to be about 115.14% cheaper to use the notebook compared to the Dataflow Gen2. I understand the cost comparison in my example is very small, but what I have seen on larger workloads is that this becomes quite significant.
Another thing that I must factor in is looking at how many CU(s) are consumed for each of the workloads.
When I compare this there is a significant difference between them, the notebook consumes 340.30% less CU(s). This is certainly something to consider when looking at how many CU(s) you get to consume daily. For an F2 capacity there are 172,800 CU(s) per day to be consumed.
Here is a cost calculator if you want to work out the estimated costs.
Comparing usability
Another consideration is which one is easier to use when comparing Dataflows Gen2 or Notebooks.
I personally think that this might vary depending on what changes need to happen with your data.
My example is simple, but I think it still demonstrates how this can be achieved with the different workloads.
Notebook usability
To create the new column using the notebook what I did was to use the Data Wrangler built into the notebook to make the change.
It is certainly more of a challenge to learn a new language such as Python when comparing to using something with a GUI, such as Dataflow Gen2.
With, what I have learnt is that there are a lot of working examples that you can search for and the code very often the code is flexible.
When trying to create the new column using a notebook, I first had to load the data into a data frame. And then I used the Data Wrangler to create my new column with the “New column by example”
This is the code that it created below, the great thing about this is I did not write one line of the actual code!
If you are interested in how this works you can read my blog post An easy way to transform/clean your data using a Notebook in Microsoft Fabric – FourMoo
This made it quite easy to create the column, load it to my data frame and then write the output to a Lakehouse table.
Dataflow Gen2 usability
People will be a lot more familiar with using data flow gen 2 because this is Power Query (as in Power BI Desktop) but on the web.
The Dataflow Gen 2 features a GUI which makes it a lot easier and potentially and potentially quicker to complete tasks.
In my working example, I had to use a custom column to get the data in the format that I wanted It was particularly easy to get this done.
The only additional step I had to do was to enable the data destination, which enabled me to write the data to a Lakehouse table.
Usability Outcome
For me when looking or comparing which one is easier to use between a notebook or the Dataflow Gen 2 my initial reaction is to go with the Dataflow Gen 2 because this is something I am more familiar with.
But to be fair, there is a lot of code examples and functionality with using a notebook. I have also found that when the data sets are larger, a notebook can run things significantly quicker. What I do enjoy with the notebook is that once I have got a piece of code, I can then save this code and reuse it multiple times when needed in other data set requirements.
For me to determine an outcome. I guess it would depend on how much data transformations I need to do, as well as how large the dataset I’m working with is what I would probably do would be to use a combination of both the data flow gen 2 and the notebook.
I could then get the best of both worlds; maybe more complex transformations would be done using a Dataflow Gen 2 and larger datasets would be loaded using notebooks.
With that being said, I am finding that Python is becoming easier the more I use and there is a lot of code out there which can be easily used.
Summary
In this blog post I have explained the different costs between the Dataflow Gen and Notebooks.
I also explained some differences on how to transform the data.
In my personal opinion I am going to be using the Notebooks much more in the future. I feel that is where I can learn a lot and share my knowledge with others.
I am certainly going to be focusing on more Fabric content as this is a new area for me and I do hope that it will also be valuable content for you the reader to learn about.
If you have any suggestions about potential future blog posts, please let me know.
Nice article. The cost calculations are a bit off. Don’t use the duration in the calculation. The column labeled “CU(s)” is the Consumption Unit Seconds, which is the measure of how much capacity the operation consumed.
Cost calc would be something like:
An F2 has 2 CUs and costs $0.42/hour in that region. So that’s $0.21 per CU hour. Or put another way, that $0.42 gives you a quota of 7,200 CU Sec each hour.
The notebook run consumed 339 CU Sec, or 0.094 CU Hour, so cost $0.02, or about 5% of the hourly quota.
The dataflow consumed 1154 CU Sec, or 0.32 CU Hour, so cost $0.07, or about 16% of the hourly quota.
Cheers,
David
Hi David,
Thanks for helping me understand how the cost calculation works. I will update my blog post and make people aware!
Why is notebook much faster/cheaper than dataflow gen2?
It is because it consumes less resources is the quickest way to explain it/
I agree with David’s calculation. This is how I thought is would work.
Where I get lost is with Bursting/Smoothing. This would see these CU (Capacity Units?) spread across the next 24 hours. If you stop the F2 capacity all the smoothed CU are immediately after completing these two tasks – you will be changed $0.09. Your workloads would have be completed in 170 seconds (Notebook 339 at 2 CU per second), and 577 seconds (Dataflow 1154 at 2 CU per second), and with bursting they were reduced to 84 seconds and 97 seconds.
But this is then smoothed across 24 hours. And if the capacity is left on and unused, so the actual charge would be $10.08 and 171k CU during that 24 hour period are wasted.
[…] I am also going to use my previous example of where I was comparing the costs for running a Dataflow Gen2 and Notebook: Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability – FourMoo […]
Hi Gilbert
For Extract and load feature ,we have also our old copy task for SQL queries, tables ,files and folders on datafactory
About E.L on Microsoft Fabric
==>
1) yes notebooks who carried pyspark are powerful and robust for production
But WE need python skills
2) Ihope data flowgen2 will be more robust on the future like one airbyte for example
I would like Microsoft to improve api management (service principal , header, body continuation token etc …) on power query
And also an Interface for slow change dimensions upserts ( date start , date end , Key)
3)how to improve copy ?
Hi Christophe,
Thanks for the comment and I can certainly understand why it would be easier to have all those things in Dataflow Gen2!
It might get there one day, but doing that in a notebook is very easy to complete!