How to Perform Union on DataFrames with Different Columns in Spark?

Performing a union operation on DataFrames with different columns requires aligning the schemas of the DataFrames before union. In Spark, this can be achieved by adding the missing columns with null values to each DataFrame, ensuring they have identical schemas before applying the union operation. Below is a step-by-step explanation and code snippets in PySpark and Scala to illustrate the process.

PySpark Example

In PySpark, you can perform the union operation by following these steps:

  1. Identify the different columns between the DataFrames.
  2. Add missing columns (with null values) to each DataFrame.
  3. Reorder the columns to ensure the schemas match.
  4. Perform the union operation.

Step-by-step code in PySpark:


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Initialize Spark session
spark = SparkSession.builder.appName("UnionDataFrames").getOrCreate()

# Create two DataFrames with different columns
df1 = spark.createDataFrame([(1, 'John'), (2, 'Jane')], ['id', 'name'])
df2 = spark.createDataFrame([(3, 'HR'), (4, 'Engineering')], ['dept_id', 'dept_name'])

# Show original DataFrames
df1.show()
df2.show()

+---+----+
| id|name|
+---+----+
|  1|John|
|  2|Jane|
+---+----+

+-------+----------+
|dept_id| dept_name|
+-------+----------+
|      3|        HR|
|      4|Engineering|
+-------+----------+

Add missing columns to each DataFrame and align their schemas:


# Add missing columns as null
df1_with_missing_columns = df1.withColumn("dept_id", lit(None)).withColumn("dept_name", lit(None))
df2_with_missing_columns = df2.withColumn("id", lit(None)).withColumn("name", lit(None))

# Reorder columns to match the schemas
columns = ["id", "name", "dept_id", "dept_name"]
df1_aligned = df1_with_missing_columns.select(columns)
df2_aligned = df2_with_missing_columns.select(columns)

# Perform the union operation
union_df = df1_aligned.union(df2_aligned)

# Show the result
union_df.show()

+----+----+-------+----------+
|  id|name|dept_id| dept_name|
+----+----+-------+----------+
|   1|John|   null|      null|
|   2|Jane|   null|      null|
|null|null|      3|        HR|
|null|null|      4|Engineering|
+----+----+-------+----------+

Scala Example

In Scala, the process is similar:

Step-by-step code in Scala:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit

// Initialize Spark session
val spark = SparkSession.builder.appName("UnionDataFrames").getOrCreate()
import spark.implicits._

// Create two DataFrames with different columns
val df1 = Seq((1, "John"), (2, "Jane")).toDF("id", "name")
val df2 = Seq((3, "HR"), (4, "Engineering")).toDF("dept_id", "dept_name")

// Show original DataFrames
df1.show()
df2.show()

+---+----+
| id|name|
+---+----+
|  1|John|
|  2|Jane|
+---+----+

+-------+----------+
|dept_id| dept_name|
+-------+----------+
|      3|        HR|
|      4|Engineering|
+-------+----------+

Add missing columns to each DataFrame and align their schemas:


// Add missing columns as null
val df1_with_missing_columns = df1.withColumn("dept_id", lit(null: Integer)).withColumn("dept_name", lit(null: String))
val df2_with_missing_columns = df2.withColumn("id", lit(null: Integer)).withColumn("name", lit(null: String))

// Reorder columns to match the schemas
val columns = Seq("id", "name", "dept_id", "dept_name")
val df1_aligned = df1_with_missing_columns.select(columns.head, columns.tail: _*)
val df2_aligned = df2_with_missing_columns.select(columns.head, columns.tail: _*)

// Perform the union operation
val union_df = df1_aligned.unionByName(df2_aligned)

// Show the result
union_df.show()

+----+----+-------+----------+
|  id|name|dept_id| dept_name|
+----+----+-------+----------+
|   1|John|   null|      null|
|   2|Jane|   null|      null|
|null|null|      3|        HR|
|null|null|      4|Engineering|
+----+----+-------+----------+

By ensuring the schemas of the DataFrames are identical before performing the union, you can easily combine DataFrames with different columns using Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top