How to Join on Multiple Columns in PySpark: A Step-by-Step Guide

Joining on multiple columns in PySpark is a common operation when working with data frames. Whether you are looking to join data frames on multiple condition columns or multiple identical columns in both data frames, PySpark provides straightforward methods to achieve this. Here’s a step-by-step guide to join on multiple columns in PySpark:

Step-by-Step Guide

Step 1: Initialize SparkSession

To work with PySpark, you need to initialize a SparkSession. This is the entry point for using DataFrame and SQL functionalities.


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Join on Multiple Columns Example") \
    .getOrCreate()

Step 2: Create Sample DataFrames

We will create two sample DataFrames with some common columns to demonstrate multi-column joins.


from pyspark.sql import Row

# Create DataFrame 1
data1 = [Row(id=1, name="Alice", dept="HR"),
         Row(id=2, name="Bob", dept="Engineering"),
         Row(id=3, name="Cathy", dept="Finance")]

df1 = spark.createDataFrame(data1)

# Create DataFrame 2
data2 = [Row(id=1, name="Alice", salary=5000),
         Row(id=2, name="Bob", salary=6000),
         Row(id=3, name="Cathy", salary=7000),
         Row(id=4, name="David", salary=8000)]

df2 = spark.createDataFrame(data2)

df1.show()
df2.show()

+---+-----+-----------+
| id| name|       dept|
+---+-----+-----------+
|  1|Alice|         HR|
|  2|  Bob|Engineering|
|  3|Cathy|    Finance|
+---+-----+-----------+

+---+-----+------+
| id| name|salary|
+---+-----+------+
|  1|Alice|  5000|
|  2|  Bob|  6000|
|  3|Cathy|  7000|
|  4|David|  8000|
+---+-----+------+

Step 3: Perform Join on Multiple Columns

Here we will join `df1` and `df2` on both the `id` and `name` columns.


from pyspark.sql.functions import col

# Perform Inner Join on multiple columns
joined_df = df1.join(df2, (df1.id == df2.id) & (df1.name == df2.name), "inner")

# Show the results
joined_df.show()

+---+-----+-----------+---+-----+------+
| id| name|       dept| id| name|salary|
+---+-----+-----------+---+-----+------+
|  1|Alice|         HR|  1|Alice|  5000|
|  2|  Bob|Engineering|  2|  Bob|  6000|
|  3|Cathy|    Finance|  3|Cathy|  7000|
+---+-----+-----------+---+-----+------+

Step 4: Selecting Specific Columns After Join

Often, you’ll want to select only specific columns from the joined DataFrame.


# Select specific columns
selected_columns = joined_df.select(df1["id"], df1["name"], df1["dept"], df2["salary"])

# Show the results
selected_columns.show()

+---+-----+-----------+------+
| id| name|       dept|salary|
+---+-----+-----------+------+
|  1|Alice|         HR|  5000|
|  2|  Bob|Engineering|  6000|
|  3|Cathy|    Finance|  7000|
+---+-----+-----------+------+

Step 5: Handle Join Types

You can choose different types of joins such as inner, left, right, and outer joins based on your requirements.


# Left Join
left_join_df = df1.join(df2, (df1.id == df2.id) & (df1.name == df2.name), "left")
left_join_df.show()

# Right Join
right_join_df = df1.join(df2, (df1.id == df2.id) & (df1.name == df2.name), "right")
right_join_df.show()

# Full Outer Join
outer_join_df = df1.join(df2, (df1.id == df2.id) & (df1.name == df2.name), "outer")
outer_join_df.show()

# Left Join
+---+-----+-----------+----+-----+------+
| id| name|       dept|  id| name|salary|
+---+-----+-----------+----+-----+------+
|  1|Alice|         HR|   1|Alice|  5000|
|  2|  Bob|Engineering|   2|  Bob|  6000|
|  3|Cathy|    Finance|   3|Cathy|  7000|
+---+-----+-----------+----+-----+------+

# Right Join
+----+-----+-----------+---+-----+------+
|  id| name|       dept| id| name|salary|
+----+-----+-----------+---+-----+------+
|   1|Alice|         HR|  1|Alice|  5000|
|   2|  Bob|Engineering|  2|  Bob|  6000|
|   3|Cathy|    Finance|  3|Cathy|  7000|
|null| null|       null|  4|David|  8000|
+----+-----+-----------+---+-----+------+

# Full Outer Join
+----+-----+-----------+----+-----+------+
|  id| name|       dept|  id| name|salary|
+----+-----+-----------+----+-----+------+
|   1|Alice|         HR|   1|Alice|  5000|
|   2|  Bob|Engineering|   2|  Bob|  6000|
|   3|Cathy|    Finance|   3|Cathy|  7000|
|null| null|       null|   4|David|  8000|
+----+-----+-----------+----+-----+------+

And there you have it! Joining on multiple columns in PySpark can be efficiently performed to suit a variety of use cases. Pay attention to the join type and column selection to ensure your DataFrame reflects the desired schema and data.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top