How to Rename Columns After Aggregating in PySpark DataFrame?

Renaming columns after performing aggregation in a PySpark DataFrame is a common operation. Once you have computed your aggregations, you can use the `.alias()` method to rename the columns. Below, I will illustrate this with a simple example.

Example: Renaming Columns After Aggregation in PySpark DataFrame

Let’s assume we have a DataFrame with some sales data, and we want to calculate the total sales and the average sales per category. After the aggregation, we will rename the resulting columns.

Step-by-Step Implementation

Step 1: Initialize Spark Session


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("RenameColumnsAfterAggregation").getOrCreate()

Step 2: Create a Sample DataFrame


from pyspark.sql import Row

# Sample data
data = [Row(category='Electronics', sales=100),
        Row(category='Electronics', sales=200),
        Row(category='Furniture', sales=300),
        Row(category='Furniture', sales=150),
        Row(category='Groceries', sales=50)]

# Create DataFrame
df = spark.createDataFrame(data)
df.show()

+-----------+-----+
|   category|sales|
+-----------+-----+
|Electronics|  100|
|Electronics|  200|
|  Furniture|  300|
|  Furniture|  150|
|  Groceries|   50|
+-----------+-----+

Step 3: Perform Aggregation

We will calculate the total and average sales per category.


from pyspark.sql import functions as F

# Perform aggregation
agg_df = df.groupBy("category").agg(
    F.sum("sales").alias("total_sales"),
    F.avg("sales").alias("avg_sales")
)
agg_df.show()

+-----------+-----------+---------+
|   category|total_sales|avg_sales|
+-----------+-----------+---------+
|  Furniture|        450|    225.0|
|Electronics|        300|    150.0|
|  Groceries|         50|     50.0|
+-----------+-----------+---------+

Step 4: Rename Columns Post Aggregation

Although we already renamed the columns during aggregation using `.alias()`, you can further rename the columns if needed using the `.withColumnRenamed()` method.


# Further renaming columns if needed
renamed_df = agg_df.withColumnRenamed("total_sales", "SumOfSales") \
                  .withColumnRenamed("avg_sales", "AverageSales")
renamed_df.show()

+-----------+----------+------------+
|   category|SumOfSales|AverageSales|
+-----------+----------+------------+
|  Furniture|       450|       225.0|
|Electronics|       300|       150.0|
|  Groceries|        50|        50.0|
+-----------+----------+------------+

We have successfully renamed the columns after performing the aggregation operation on the DataFrame. You can also add more complex operations and renaming as per your requirements.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top