Renaming columns after performing aggregation in a PySpark DataFrame is a common operation. Once you have computed your aggregations, you can use the `.alias()` method to rename the columns. Below, I will illustrate this with a simple example.
Example: Renaming Columns After Aggregation in PySpark DataFrame
Let’s assume we have a DataFrame with some sales data, and we want to calculate the total sales and the average sales per category. After the aggregation, we will rename the resulting columns.
Step-by-Step Implementation
Step 1: Initialize Spark Session
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("RenameColumnsAfterAggregation").getOrCreate()
Step 2: Create a Sample DataFrame
from pyspark.sql import Row
# Sample data
data = [Row(category='Electronics', sales=100),
Row(category='Electronics', sales=200),
Row(category='Furniture', sales=300),
Row(category='Furniture', sales=150),
Row(category='Groceries', sales=50)]
# Create DataFrame
df = spark.createDataFrame(data)
df.show()
+-----------+-----+
| category|sales|
+-----------+-----+
|Electronics| 100|
|Electronics| 200|
| Furniture| 300|
| Furniture| 150|
| Groceries| 50|
+-----------+-----+
Step 3: Perform Aggregation
We will calculate the total and average sales per category.
from pyspark.sql import functions as F
# Perform aggregation
agg_df = df.groupBy("category").agg(
F.sum("sales").alias("total_sales"),
F.avg("sales").alias("avg_sales")
)
agg_df.show()
+-----------+-----------+---------+
| category|total_sales|avg_sales|
+-----------+-----------+---------+
| Furniture| 450| 225.0|
|Electronics| 300| 150.0|
| Groceries| 50| 50.0|
+-----------+-----------+---------+
Step 4: Rename Columns Post Aggregation
Although we already renamed the columns during aggregation using `.alias()`, you can further rename the columns if needed using the `.withColumnRenamed()` method.
# Further renaming columns if needed
renamed_df = agg_df.withColumnRenamed("total_sales", "SumOfSales") \
.withColumnRenamed("avg_sales", "AverageSales")
renamed_df.show()
+-----------+----------+------------+
| category|SumOfSales|AverageSales|
+-----------+----------+------------+
| Furniture| 450| 225.0|
|Electronics| 300| 150.0|
| Groceries| 50| 50.0|
+-----------+----------+------------+
We have successfully renamed the columns after performing the aggregation operation on the DataFrame. You can also add more complex operations and renaming as per your requirements.