How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

This is how I did it below.

In the example below it contains a table which has got the start and end times for data being processed in a Power BI Semantic model.

I wanted to only get the row with the highest duration which has got the value in the Duration of 18500

A screenshot of a computer

Description automatically generated

To do this I used the following PySpark code below.

import pyspark
from pyspark.sql import SparkSession

# #create an app called “OrderByDf”
spark_app = SparkSession.builder.appName('OrderByDf').getOrCreate()

# # create the dataframe
df_dataframe = spark_app.createDataFrame(df)

# Order by Column Name "df_dataframe.ColumnName.desc()"
# Select the Top 1 Value
# If I wanted the top(10), I would change the limit to “limit(10)”
df = df_dataframe.orderBy(df_dataframe.Duration.desc()).limit(1)

#Display Value 
df.show()

I then ran the code in my notebook, and as shown below I got back the single row with the highest duration.

Summary

In this blog post I have shown you how I got the Top 1 row from my dataframe based on my requirements.

I do hope you found this useful, and any comments or suggestions are most welcome for this and any other challenges you are facing when using PySpark notebooks.

1 Comment

How to get the TopN rows using Python in Fabric Notebooks

Summary

Related

Tags In

1 Comment

Leave a Reply Cancel reply

How to get the TopN rows using Python in Fabric Notebooks

Summary

Share this:

Related

Tags In

Related Posts

Part 4 – Creating a Power BI dataset and report using DirectLake

How to create a Case Insensitive Warehouse in Microsoft Fabric

How to calculate Microsoft Fabric costs

1 Comment

Leave a Reply Cancel reply