To group by a column and filter rows with the maximum value in PySpark, you can use a combination of the `groupBy` and `agg` methods, followed by joining the original DataFrame to filter the rows with the maximum values. Here’s a detailed step-by-step explanation with code snippets:
Step-by-Step Guide
1. Sample DataFrame Creation
First, let’s create a sample DataFrame to work with.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max
# Initialize SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample DataFrame
data = [
("Alice", "Math", 85),
("Alice", "Science", 92),
("Bob", "Math", 82),
("Bob", "Science", 90),
("Charlie", "Math", 88),
("Charlie", "Science", 85)
]
columns = ["Name", "Subject", "Score"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-------+-------+-----+
| Name|Subject|Score|
+-------+-------+-----+
| Alice| Math| 85|
| Alice|Science| 92|
| Bob| Math| 82|
| Bob|Science| 90|
|Charlie| Math| 88|
|Charlie|Science| 85|
+-------+-------+-----+
2. Apply Group By and Aggregation
Next, we group the DataFrame by the “Name” column and find the maximum score for each person.
# GroupBy "Name" and find the max "Score"
max_score_df = df.groupBy("Name").agg(max("Score").alias("MaxScore"))
max_score_df.show()
Output:
+-------+--------+
| Name|MaxScore|
+-------+--------+
| Alice| 92|
| Bob| 90|
|Charlie| 88|
+-------+--------+
3. Join with Original DataFrame to Filter Rows
Finally, join the original DataFrame with the aggregated DataFrame on the “Name” column and filter rows where the score matches the maximum score.
# Join with the original DataFrame and filter rows
result_df = df.join(max_score_df, on="Name").filter(df["Score"] == max_score_df["MaxScore"])
result_df.show()
Output:
+-------+-------+-----+--------+
| Name|Subject|Score|MaxScore|
+-------+-------+-----+--------+
| Alice|Science| 92| 92|
| Bob|Science| 90| 90|
|Charlie| Math| 88| 88|
+-------+-------+-----+--------+
In this final result, we get the rows for each person where the score is the maximum score.
Summary
To summarize, the process involves:
1. Creating a sample DataFrame.
2. Using the `groupBy` and `agg` methods to find the maximum value for each group.
3. Joining the result with the original DataFrame and filtering rows where the values match the maximum values.