How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

This is how I did it below.

In the example below it contains a table which has got the start and end times for data being processed in a Power BI Semantic model.

I wanted to only get the row with the highest duration which has got the value in the Duration of 18500

A screenshot of a computer

Description automatically generated

To do this I used the following PySpark code below.

import pyspark
from pyspark.sql import SparkSession

# #create an app called “OrderByDf”
spark_app = SparkSession.builder.appName('OrderByDf').getOrCreate()

# # create the dataframe
df_dataframe = spark_app.createDataFrame(df)

# Order by Column Name "df_dataframe.ColumnName.desc()"
# Select the Top 1 Value
# If I wanted the top(10), I would change the limit to “limit(10)”
df = df_dataframe.orderBy(df_dataframe.Duration.desc()).limit(1)

#Display Value 
df.show()

I then ran the code in my notebook, and as shown below I got back the single row with the highest duration.

Summary

In this blog post I have shown you how I got the Top 1 row from my dataframe based on my requirements.

I do hope you found this useful, and any comments or suggestions are most welcome for this and any other challenges you are facing when using PySpark notebooks.