How to Rename Multiple Columns Using withColumnRenamed in Apache Spark?

Renaming multiple columns in Apache Spark can be efficiently done using the `withColumnRenamed` method within a loop. The `withColumnRenamed` method creates a new DataFrame and renames a specified column from the original DataFrame. By chaining multiple `withColumnRenamed` calls, you can rename multiple columns. Here is a way to do this using PySpark, but the logic is applicable in other languages like Scala and Java as well.

Using PySpark

We use the `withColumnRenamed` method to rename multiple columns in a DataFrame as shown below.


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Rename Columns Example") \
    .getOrCreate()

# Sample data
data = [("James", "Smith"), ("Anna", "Rose"), ("Robert", "Williams")]
columns = ["firstname", "lastname"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()

# Output before renaming
# +---------+--------+
# |firstname|lastname|
# +---------+--------+
# |    James|   Smith|
# |     Anna|    Rose|
# |   Robert|Williams|
# +---------+--------+

# List of columns to be renamed
new_column_names = [("firstname", "first_name"), ("lastname", "last_name")]

# Loop to rename columns
for old_name, new_name in new_column_names:
    df = df.withColumnRenamed(old_name, new_name)

df.show()

# Output after renaming
# +----------+---------+
# |first_name|last_name|
# +----------+---------+
# |     James|    Smith|
# |      Anna|     Rose|
# |    Robert| Williams|
# +----------+---------+

Output:


# DataFrame before renaming columns
+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|     Anna|    Rose|
|   Robert|Williams|
+---------+--------+

# DataFrame after renaming columns
+----------+---------+
|first_name|last_name|
+----------+---------+
|     James|    Smith|
|      Anna|     Rose|
|    Robert| Williams|
+----------+---------+

Using Scala


import org.apache.spark.sql.SparkSession

// Initialize Spark session
val spark = SparkSession.builder
  .appName("Rename Columns Example")
  .getOrCreate()

// Sample data
val data = Seq(("James", "Smith"), ("Anna", "Rose"), ("Robert", "Williams"))
val columns = Seq("firstname", "lastname")

// Create DataFrame
val df = spark.createDataFrame(data).toDF(columns: _*)
df.show()

// Output before renaming
// +---------+--------+
// |firstname|lastname|
// +---------+--------+
// |    James|   Smith|
// |     Anna|    Rose|
// |   Robert|Williams|
// +---------+--------+

// List of columns to be renamed
val newColumnNames = Seq(("firstname", "first_name"), ("lastname", "last_name"))

// Loop to rename columns
val dfRenamed = newColumnNames.foldLeft(df) { (tempDF, rename) =>
  tempDF.withColumnRenamed(rename._1, rename._2)
}

dfRenamed.show()

// Output after renaming
// +----------+---------+
// |first_name|last_name|
// +----------+---------+
// |     James|    Smith|
// |      Anna|     Rose|
// |    Robert| Williams|
// +----------+---------+

Output:


# DataFrame before renaming columns
+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|     Anna|    Rose|
|   Robert|Williams|
+---------+--------+

# DataFrame after renaming columns
+----------+---------+
|first_name|last_name|
+----------+---------+
|     James|    Smith|
|      Anna|     Rose|
|    Robert| Williams|
+----------+---------+

Using Java


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

// Initialize Spark session
SparkSession spark = SparkSession.builder()
        .appName("Rename Columns Example")
        .getOrCreate();

// Sample data
List<Row> data = Arrays.asList(
        RowFactory.create("James", "Smith"),
        RowFactory.create("Anna", "Rose"),
        RowFactory.create("Robert", "Williams")
);
StructType schema = new StructType(new StructField[]{
        new StructField("firstname", DataTypes.StringType, false, Metadata.empty()),
        new StructField("lastname", DataTypes.StringType, false, Metadata.empty())
});

// Create DataFrame
Dataset<Row> df = spark.createDataFrame(data, schema);
df.show();

// Output before renaming
// +---------+--------+
// |firstname|lastname|
// +---------+--------+
// |    James|   Smith|
// |     Anna|    Rose|
// |   Robert|Williams|
// +---------+--------+

// List of columns to be renamed
String[][] newColumnNames = {{"firstname", "first_name"}, {"lastname", "last_name"}};

// Loop to rename columns
for (String[] rename : newColumnNames) {
    df = df.withColumnRenamed(rename[0], rename[1]);
}

df.show();

// Output after renaming
// +----------+---------+
// |first_name|last_name|
// +----------+---------+
// |     James|    Smith|
// |      Anna|     Rose|
// |    Robert| Williams|
// +----------+---------+

Output:


# DataFrame before renaming columns
+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|     Anna|    Rose|
|   Robert|Williams|
+---------+--------+

# DataFrame after renaming columns
+----------+---------+
|first_name|last_name|
+----------+---------+
|     James|    Smith|
|      Anna|     Rose|
|    Robert| Williams|
+----------+---------+

In all these examples, the approach is similar. The renaming is achieved through iterating over a list of tuples (old and new column names) and using the `withColumnRenamed` method to perform the actual renaming.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top