What are the Different Types of Joins in Apache Spark?

Apache Spark provides several types of joins to combine data from multiple DataFrames or RDDs. Understanding these join types and knowing when to use them is crucial for efficient data processing. Let’s discuss the main types of joins offered by Apache Spark.

Types of Joins in Apache Spark

Here are the primary types of joins available in Apache Spark:

1. Inner Join

An inner join returns only the rows that have matching values in both DataFrames or RDDs.

Example in PySpark


from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize Spark session
spark = SparkSession.builder.appName("Joins").getOrCreate()

# Create sample DataFrames
df1 = spark.createDataFrame([
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie")
], ["id", "name"])

df2 = spark.createDataFrame([
    (1, "Math"),
    (2, "Physics"),
    (4, "Biology")
], ["id", "subject"])

# Perform Inner Join
inner_join_df = df1.join(df2, df1.id == df2.id, "inner")
inner_join_df.show()

+---+-----+---+-------+
| id| name| id|subject|
+---+-----+---+-------+
|  1|Alice|  1|   Math|
|  2|  Bob|  2| Physics|
+---+-----+---+-------+

2. Left Outer Join

A left outer join returns all the rows from the left DataFrame and the matched rows from the right DataFrame. Rows in the left DataFrame that do not have a match in the right DataFrame will have null in the corresponding columns of the right DataFrame.

Example in PySpark


# Perform Left Outer Join
left_outer_join_df = df1.join(df2, df1.id == df2.id, "left_outer")
left_outer_join_df.show()

+---+-------+----+-------+
| id|  name |  id|subject|
+---+-------+----+-------+
|  1|  Alice|   1|   Math|
|  2|    Bob|   2|Physics|
|  3|Charlie|null|   null|
+---+-------+----+-------+

3. Right Outer Join

A right outer join returns all the rows from the right DataFrame and the matched rows from the left DataFrame. Rows in the right DataFrame that do not have a match in the left DataFrame will have null in the corresponding columns of the left DataFrame.

Example in PySpark


# Perform Right Outer Join
right_outer_join_df = df1.join(df2, df1.id == df2.id, "right_outer")
right_outer_join_df.show()

+----+-------+---+-------+
| id |  name | id|subject|
+----+-------+---+-------+
|   1|  Alice|  1|   Math|
|   2|    Bob|  2|Physics|
|null|   null|  4|Biology|
+----+-------+---+-------+

4. Full Outer Join

A full outer join returns all the rows when there is a match in either the left or right DataFrame. Rows from either DataFrame that do not have a match will have null in the corresponding columns of the other DataFrame.

Example in PySpark


# Perform Full Outer Join
full_outer_join_df = df1.join(df2, df1.id == df2.id, "outer")
full_outer_join_df.show()

+----+-------+----+-------+
| id |  name | id |subject|
+----+-------+----+-------+
|   1|  Alice|  1 |   Math|
|   2|    Bob|  2 |Physics|
|   3|Charlie|null|   null|
|null|   null|  4 |Biology|
+----+-------+----+-------+

5. Left Semi Join

A left semi join returns all the rows from the left DataFrame where there is a match in the right DataFrame. It only returns columns from the left DataFrame.

Example in PySpark


# Perform Left Semi Join
left_semi_join_df = df1.join(df2, df1.id == df2.id, "left_semi")
left_semi_join_df.show()

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

6. Left Anti Join

A left anti join returns only the rows from the left DataFrame that do not have a match in the right DataFrame. It only returns columns from the left DataFrame.

Example in PySpark


# Perform Left Anti Join
left_anti_join_df = df1.join(df2, df1.id == df2.id, "left_anti")
left_anti_join_df.show()

+---+-------+
| id|  name |
+---+-------+
|  3|Charlie|
+---+-------+

These are the primary types of joins available in Apache Spark. Each join type serves different use cases and understanding them can greatly help in optimizing and structuring data operations effectively in Spark applications.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top