How to Use Aggregate Functions on Multiple Columns in Spark SQL?

To use aggregate functions on multiple columns in Spark SQL, you can leverage the `select` method in DataFrames along with various built-in aggregate functions like `count`, `sum`, `avg`, `min`, and `max`. You can use these functions to perform aggregations on multiple columns simultaneously. Below, I’ll provide an example using PySpark, a popular API for Apache Spark in Python.

Contents hide

1 Using Aggregate Functions on Multiple Columns in PySpark

1.1 Explanation

2 About Editorial Team

3 You Might Also Like:

Using Aggregate Functions on Multiple Columns in PySpark

First, you need to create a Spark session and a sample DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, max, min

# Create Spark session
spark = SparkSession.builder.appName("AggregateFunctionsExample").getOrCreate()

# Create a sample DataFrame
data = [
    ("Alice", "Sales", 3000),
    ("Bob", "Sales", 4000),
    ("Charlie", "Sales", 5000),
    ("Alice", "HR", 3500),
    ("Bob", "HR", 3500),
    ("Charlie", "HR", 3500)
]

schema = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()


+-------+----------+------+
|   Name|Department|Salary|
+-------+----------+------+
|  Alice|     Sales|  3000|
|    Bob|     Sales|  4000|
|Charlie|     Sales|  5000|
|  Alice|        HR|  3500|
|    Bob|        HR|  3500|
|Charlie|        HR|  3500|
+-------+----------+------+

Now, let’s perform aggregate functions on the columns:


# Perform aggregation on multiple columns
result_df = df.groupBy("Department").agg(
    sum("Salary").alias("Total_Salary"),
    avg("Salary").alias("Average_Salary"),
    max("Salary").alias("Max_Salary"),
    min("Salary").alias("Min_Salary")
)

# Show the result
result_df.show()


+----------+-----------+--------------+----------+----------+
|Department|Total_Salary|Average_Salary|Max_Salary|Min_Salary|
+----------+-----------+--------------+----------+----------+
|     Sales|       12000|        4000.0|      5000|      3000|
|        HR|       10500|        3500.0|      3500|      3500|
+----------+-----------+--------------+----------+----------+

Explanation

In the example above:

We grouped the data by the “Department” column using the `groupBy` method.
We used the `agg` method to apply multiple aggregate functions to the “Salary” column.
The `sum`, `avg`, `max`, and `min` functions calculate the respective aggregate values and return the result as a new DataFrame.
We used the `alias` method to rename the resulting columns for better readability.

This approach can be easily adapted to other aggregate functions and columns depending on your use case.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Using Aggregate Functions on Multiple Columns in PySpark

Explanation

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply