Joining two DataFrames is a common operation in Apache Spark. Often, you might need to select all columns from one DataFrame and specific columns from another. Below are the detailed steps and code snippets showcasing how to achieve this using PySpark.
Using PySpark to Join DataFrames
Let’s assume you have two DataFrames, `df1` and `df2`. `df1` contains columns `[“id”, “name”, “age”]` and `df2` contains columns `[“id”, “address”, “salary”]`. We want to join these DataFrames on the column `id`, selecting all columns from `df1` and only the `address` column from `df2`.
Sample DataFrames
First, let’s create sample data and corresponding DataFrames:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("JoinExample").getOrCreate()
# Sample data for DataFrame1
data1 = [
(1, "Alice", 29),
(2, "Bob", 31),
(3, "Cathy", 28)
]
columns1 = ["id", "name", "age"]
# Sample data for DataFrame2
data2 = [
(1, "1234 Park Ave", 60000),
(2, "5678 Main St", 75000),
(3, "9101 Broadway", 80000)
]
columns2 = ["id", "address", "salary"]
# Create DataFrames
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)
Performing the Join
To join the DataFrames and select the required columns, you can use the `join` method along with column selection:
from pyspark.sql.functions import col
# Perform the join
joined_df = df1.join(df2, on="id")
# Select all columns from df1 and the 'address' column from df2
result_df = joined_df.select(df1["*"], col("address"))
# Show the result
result_df.show()
+---+-----+---+-------------+
| id| name|age| address|
+---+-----+---+-------------+
| 1|Alice| 29|1234 Park Ave|
| 2| Bob| 31|5678 Main St |
| 3|Cathy| 28|9101 Broadway|
+---+-----+---+-------------+
In the above code:
- The DataFrames `df1` and `df2` are created from sample data.
- The `join` method is used to perform an inner join on the column `id`.
- The `select` method is used to select all columns from `df1` and the `address` column from `df2`.
Conclusion
This method allows you to join two DataFrames and select all columns from one DataFrame while picking specific columns from the second DataFrame in Apache Spark using PySpark. This approach ensures you have full control over the resulting DataFrame’s schema and data.