How to Filter a DataFrame in PySpark Using Columns from Another DataFrame?

Filtering a DataFrame in PySpark using columns from another DataFrame is a common operation. This is often done when you have two DataFrames and you want to filter rows in one DataFrame based on values in another DataFrame. You can accomplish this using a join operation or a broadcast variable. Below, I’ll show you both approaches with detailed explanations and examples in Python using PySpark.

Contents hide

1 Approach 1: Using Join Operation

2 Approach 2: Using a Broadcast Variable

3 Conclusion

4 About Editorial Team

5 You Might Also Like:

Approach 1: Using Join Operation

Joining the two DataFrames on the column(s) of interest and then selecting the desired columns from the joined DataFrame is one way to filter.

Let’s assume we have two DataFrames: `df1` and `df2`.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder.master("local[*]").appName("FilterDataFrame").getOrCreate()

# Example data
data1 = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
data2 = [("Alice",), ("David",)]

# Creating DataFrames
df1 = spark.createDataFrame(data1, ["name", "age"])
df2 = spark.createDataFrame(data2, ["name"])

# Show the DataFrames
df1.show()
df2.show()


+-----+---+
| name|age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+

+-----+
| name|
+-----+
|Alice|
|David|
+-----+

Next, we perform the join operation to filter `df1` using `df2`.


# Filter df1 based on names present in df2
result_df = df1.join(df2, on="name", how="inner")

# Show the filtered DataFrame
result_df.show()


+-----+---+
| name|age|
+-----+---+
|Alice| 34|
+-----+---+

In this example, the join operation filters `df1` to include only rows where the `name` column matches a name in `df2`. The result contains only “Alice”.

Approach 2: Using a Broadcast Variable

Using a broadcast variable is another efficient way to filter when `df2` is small enough to fit in memory. This approach avoids a shuffle, making it faster in some cases.


from pyspark.sql.functions import broadcast

# Broadcast df2
df2_broadcasted = broadcast(df2)

# Filter df1 using the broadcasted df2
result_df_broadcasted = df1.join(df2_broadcasted, on="name", how="inner")

# Show the filtered DataFrame
result_df_broadcasted.show()


+-----+---+
| name|age|
+-----+---+
|Alice| 34|
+-----+---+

Both approaches yield the same result. Use the join operation when both DataFrames are large or the broadcast variable approach when `df2` is small enough to fit in memory, making the filtering operation more efficient.

Conclusion

Filtering a DataFrame in PySpark using columns from another DataFrame can be accomplished using join operations or broadcast variables, depending on the size of the data. Using these techniques ensures that the filtering process is both efficient and scalable.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Approach 1: Using Join Operation

Approach 2: Using a Broadcast Variable

Conclusion

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply