How to Concatenate Two PySpark DataFrames Efficiently?

Concatenating DataFrames is a common task in data processing pipelines. In PySpark, you can use the `union` method to concatenate DataFrames efficiently. Below is a detailed explanation along with a code snippet demonstrating the process.

Contents hide

1 Concatenating Two PySpark DataFrames

1.1 Step-by-Step Guide

1.2 Example

1.3 Output

1.4 Important Considerations

2 About Editorial Team

3 You Might Also Like:

Concatenating Two PySpark DataFrames

In PySpark, the `union` method allows you to concatenate DataFrames. For this method to work, the DataFrames must have the same schema. If they don’t, you need to either rename or cast columns to match the schema.

Step-by-Step Guide

1. **Create the DataFrames**: Start by creating two DataFrames with the same schema.
2. **Union the DataFrames**: Use the `union` method to concatenate them.
3. **Show the Result**: Display the concatenated DataFrame.

Example

Here is a code example using PySpark:


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create Spark session
spark = SparkSession.builder.appName("ConcatDataFrames").getOrCreate()

# Create DataFrame 1
data1 = [("John", 28), ("Anna", 23)]
columns1 = ["Name", "Age"]
df1 = spark.createDataFrame(data1, columns1)

# Create DataFrame 2
data2 = [("Mike", 35), ("Sara", 29)]
columns2 = ["Name", "Age"]
df2 = spark.createDataFrame(data2, columns2)

# Concatenate DataFrames using union
df_concatenated = df1.union(df2)

# Show the concatenated DataFrame
df_concatenated.show()

Output


+----+---+
|Name|Age|
+----+---+
|John| 28|
|Anna| 23|
|Mike| 35|
|Sara| 29|
+----+---+

Important Considerations

1. **Schema Matching**: Both DataFrames must have the same schema for the `union` operation to work.
2. **Performance**: While PySpark’s `union` is efficient, the actual performance can depend on various factors like partitioning and data size.

By following the above steps, you can efficiently concatenate two PySpark DataFrames.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.