How to get the TopN rows using Python in Fabric Notebooks
How to get the TopN rows using Python in Fabric Notebooks
When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.
In today’s blog post I had a situation where I wanted to get a single row with the highest duration.
This is how I did it below.
In the example below it contains a table which has got the start and end times for data being processed in a Power BI Semantic model.
I wanted to only get the row with the highest duration which has got the value in the Duration of 18500
To do this I used the following PySpark code below.
import pyspark from pyspark.sql import SparkSession # #create an app called “OrderByDf” spark_app = SparkSession.builder.appName('OrderByDf').getOrCreate() # # create the dataframe df_dataframe = spark_app.createDataFrame(df) # Order by Column Name "df_dataframe.ColumnName.desc()" # Select the Top 1 Value # If I wanted the top(10), I would change the limit to “limit(10)” df = df_dataframe.orderBy(df_dataframe.Duration.desc()).limit(1) #Display Value df.show()
I then ran the code in my notebook, and as shown below I got back the single row with the highest duration.
Summary
In this blog post I have shown you how I got the Top 1 row from my dataframe based on my requirements.
I do hope you found this useful, and any comments or suggestions are most welcome for this and any other challenges you are facing when using PySpark notebooks.
[…] Gilbert Quevauvilliers only needs the top 1: […]