How to Use Aggregate Functions on Multiple Columns in Spark SQL?

To use aggregate functions on multiple columns in Spark SQL, you can leverage the `select` method in DataFrames along with various built-in aggregate functions like `count`, `sum`, `avg`, `min`, and `max`. You can use these functions to perform aggregations on multiple columns simultaneously. Below, I’ll provide an example using PySpark, a popular API for Apache Spark in Python.

Using Aggregate Functions on Multiple Columns in PySpark

First, you need to create a Spark session and a sample DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, max, min

# Create Spark session
spark = SparkSession.builder.appName("AggregateFunctionsExample").getOrCreate()

# Create a sample DataFrame
data = [
    ("Alice", "Sales", 3000),
    ("Bob", "Sales", 4000),
    ("Charlie", "Sales", 5000),
    ("Alice", "HR", 3500),
    ("Bob", "HR", 3500),
    ("Charlie", "HR", 3500)
]

schema = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

+-------+----------+------+
|   Name|Department|Salary|
+-------+----------+------+
|  Alice|     Sales|  3000|
|    Bob|     Sales|  4000|
|Charlie|     Sales|  5000|
|  Alice|        HR|  3500|
|    Bob|        HR|  3500|
|Charlie|        HR|  3500|
+-------+----------+------+

Now, let’s perform aggregate functions on the columns:


# Perform aggregation on multiple columns
result_df = df.groupBy("Department").agg(
    sum("Salary").alias("Total_Salary"),
    avg("Salary").alias("Average_Salary"),
    max("Salary").alias("Max_Salary"),
    min("Salary").alias("Min_Salary")
)

# Show the result
result_df.show()

+----------+-----------+--------------+----------+----------+
|Department|Total_Salary|Average_Salary|Max_Salary|Min_Salary|
+----------+-----------+--------------+----------+----------+
|     Sales|       12000|        4000.0|      5000|      3000|
|        HR|       10500|        3500.0|      3500|      3500|
+----------+-----------+--------------+----------+----------+

Explanation

In the example above:

  • We grouped the data by the “Department” column using the `groupBy` method.
  • We used the `agg` method to apply multiple aggregate functions to the “Salary” column.
  • The `sum`, `avg`, `max`, and `min` functions calculate the respective aggregate values and return the result as a new DataFrame.
  • We used the `alias` method to rename the resulting columns for better readability.

This approach can be easily adapted to other aggregate functions and columns depending on your use case.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top