How to Join Two DataFrames in Apache Spark: Select All Columns from One and Specific Columns from Another?

Joining two DataFrames is a common operation in Apache Spark. Often, you might need to select all columns from one DataFrame and specific columns from another. Below are the detailed steps and code snippets showcasing how to achieve this using PySpark.

Using PySpark to Join DataFrames

Let’s assume you have two DataFrames, `df1` and `df2`. `df1` contains columns `[“id”, “name”, “age”]` and `df2` contains columns `[“id”, “address”, “salary”]`. We want to join these DataFrames on the column `id`, selecting all columns from `df1` and only the `address` column from `df2`.

Sample DataFrames

First, let’s create sample data and corresponding DataFrames:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("JoinExample").getOrCreate()

# Sample data for DataFrame1
data1 = [
    (1, "Alice", 29),
    (2, "Bob", 31),
    (3, "Cathy", 28)
]
columns1 = ["id", "name", "age"]

# Sample data for DataFrame2
data2 = [
    (1, "1234 Park Ave", 60000),
    (2, "5678 Main St", 75000),
    (3, "9101 Broadway", 80000)
]
columns2 = ["id", "address", "salary"]

# Create DataFrames
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

Performing the Join

To join the DataFrames and select the required columns, you can use the `join` method along with column selection:


from pyspark.sql.functions import col

# Perform the join
joined_df = df1.join(df2, on="id")

# Select all columns from df1 and the 'address' column from df2
result_df = joined_df.select(df1["*"], col("address"))

# Show the result
result_df.show()

+---+-----+---+-------------+
| id| name|age|      address|
+---+-----+---+-------------+
|  1|Alice| 29|1234 Park Ave|
|  2|  Bob| 31|5678 Main St |
|  3|Cathy| 28|9101 Broadway|
+---+-----+---+-------------+

In the above code:

  • The DataFrames `df1` and `df2` are created from sample data.
  • The `join` method is used to perform an inner join on the column `id`.
  • The `select` method is used to select all columns from `df1` and the `address` column from `df2`.

Conclusion

This method allows you to join two DataFrames and select all columns from one DataFrame while picking specific columns from the second DataFrame in Apache Spark using PySpark. This approach ensures you have full control over the resulting DataFrame’s schema and data.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top