How to Group by a Column and Filter Rows with Maximum Value in PySpark?

To group by a column and filter rows with the maximum value in PySpark, you can use a combination of the `groupBy` and `agg` methods, followed by joining the original DataFrame to filter the rows with the maximum values. Here’s a detailed step-by-step explanation with code snippets:

Step-by-Step Guide

1. Sample DataFrame Creation

First, let’s create a sample DataFrame to work with.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max

# Initialize SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Math", 85),
    ("Alice", "Science", 92),
    ("Bob", "Math", 82),
    ("Bob", "Science", 90),
    ("Charlie", "Math", 88),
    ("Charlie", "Science", 85)
]

columns = ["Name", "Subject", "Score"]
df = spark.createDataFrame(data, columns)

df.show()

Output:


+-------+-------+-----+
|   Name|Subject|Score|
+-------+-------+-----+
|  Alice|   Math|   85|
|  Alice|Science|   92|
|    Bob|   Math|   82|
|    Bob|Science|   90|
|Charlie|   Math|   88|
|Charlie|Science|   85|
+-------+-------+-----+

2. Apply Group By and Aggregation

Next, we group the DataFrame by the “Name” column and find the maximum score for each person.


# GroupBy "Name" and find the max "Score"
max_score_df = df.groupBy("Name").agg(max("Score").alias("MaxScore"))

max_score_df.show()

Output:


+-------+--------+
|   Name|MaxScore|
+-------+--------+
|  Alice|      92|
|    Bob|      90|
|Charlie|      88|
+-------+--------+

3. Join with Original DataFrame to Filter Rows

Finally, join the original DataFrame with the aggregated DataFrame on the “Name” column and filter rows where the score matches the maximum score.


# Join with the original DataFrame and filter rows
result_df = df.join(max_score_df, on="Name").filter(df["Score"] == max_score_df["MaxScore"])

result_df.show()

Output:


+-------+-------+-----+--------+
|   Name|Subject|Score|MaxScore|
+-------+-------+-----+--------+
|  Alice|Science|   92|      92|
|    Bob|Science|   90|      90|
|Charlie|   Math|   88|      88|
+-------+-------+-----+--------+

In this final result, we get the rows for each person where the score is the maximum score.

Summary

To summarize, the process involves:
1. Creating a sample DataFrame.
2. Using the `groupBy` and `agg` methods to find the maximum value for each group.
3. Joining the result with the original DataFrame and filtering rows where the values match the maximum values.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top