To use aggregate functions on multiple columns in Spark SQL, you can leverage the `select` method in DataFrames along with various built-in aggregate functions like `count`, `sum`, `avg`, `min`, and `max`. You can use these functions to perform aggregations on multiple columns simultaneously. Below, I’ll provide an example using PySpark, a popular API for Apache Spark in Python.
Using Aggregate Functions on Multiple Columns in PySpark
First, you need to create a Spark session and a sample DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, max, min
# Create Spark session
spark = SparkSession.builder.appName("AggregateFunctionsExample").getOrCreate()
# Create a sample DataFrame
data = [
("Alice", "Sales", 3000),
("Bob", "Sales", 4000),
("Charlie", "Sales", 5000),
("Alice", "HR", 3500),
("Bob", "HR", 3500),
("Charlie", "HR", 3500)
]
schema = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema)
# Show the DataFrame
df.show()
+-------+----------+------+
| Name|Department|Salary|
+-------+----------+------+
| Alice| Sales| 3000|
| Bob| Sales| 4000|
|Charlie| Sales| 5000|
| Alice| HR| 3500|
| Bob| HR| 3500|
|Charlie| HR| 3500|
+-------+----------+------+
Now, let’s perform aggregate functions on the columns:
# Perform aggregation on multiple columns
result_df = df.groupBy("Department").agg(
sum("Salary").alias("Total_Salary"),
avg("Salary").alias("Average_Salary"),
max("Salary").alias("Max_Salary"),
min("Salary").alias("Min_Salary")
)
# Show the result
result_df.show()
+----------+-----------+--------------+----------+----------+
|Department|Total_Salary|Average_Salary|Max_Salary|Min_Salary|
+----------+-----------+--------------+----------+----------+
| Sales| 12000| 4000.0| 5000| 3000|
| HR| 10500| 3500.0| 3500| 3500|
+----------+-----------+--------------+----------+----------+
Explanation
In the example above:
- We grouped the data by the “Department” column using the `groupBy` method.
- We used the `agg` method to apply multiple aggregate functions to the “Salary” column.
- The `sum`, `avg`, `max`, and `min` functions calculate the respective aggregate values and return the result as a new DataFrame.
- We used the `alias` method to rename the resulting columns for better readability.
This approach can be easily adapted to other aggregate functions and columns depending on your use case.