In this blog post I am going to show you an easy way to clean your data (which is often fixing data issues or mis-spelt data) using the new feature Launch Data Wranger using DataFrames

I had previously blogged about using Pandas data frames but this required extra steps and details, if you are interested in that blog post you can find it here: Did you know that there is an easy way to shape your data in Fabric Notebooks using Data Wrangler?

In this blog post I am going to show you how I cleaned up the data in my location column.

Loading the CSV data into a data frame

The first step I did was to load the CSV into a data frame in my PySpark notebook as shown below.

df_staging = spark.read.format(“csv”).option(“header”,“true”).load(“Files/Delay_Data/*.csv”)

# View the df_staging is bringing back data.

df_staging.show(10)

Once I had run this I could then see the output of the data from my CSV files.

Getting my data frame into the Data Wrangler (In my mind I think of it like Power Query)

Next, to transform my data, I need to get my data frame into the Data Wrangler with the steps below.

  1. I clicked on Data.
  2. Then I clicked on “Transform DataFrame in Data Wrangler”
  3. Finally, I select the name of my data frame “df_staging”

Transforming my data using Data Wrangler

In the steps below is where I now transformed my data for the Location column.

I expanded Operations and then selected “Find and replace”

I then completed the following steps below to complete the find and replace.

  1. I clicked on the drop down and selected the column called “Location”
  2. For the Old value, is the value I am searching for to replace.
    1. In my example I am looking for “Coast”
  3. I wanted to replace the value with nothing, so I pressed the space bar once.
  4. Once I was happy with the results I clicked on Apply.

I could then see the changes in the green column before I applied them.

I then clicked Apply (Number 4 from above).

If I wanted to see the code, I could see this in the section as shown below which is writing the code for me.

Then applied by clicking on “Add code to notebook” as shown below.

I could then see the code in the notebook as shown below.

The final step was to take my transformed data which I inserted into my Lakehouse table as shown below.

I could then go into my Lakehouse and view the “Delays” table using the SQL Endpoint.

Summary

I hope that you found this useful and that you now can easily transform your data in Fabric using the Notebooks and data wranglers.

Any comments or suggestions are most welcome!