How to Filter a PySpark DataFrame Using SQL-like IN Clause?

Filtering a DataFrame using an SQL-like IN clause is a common requirement when working with PySpark. You can achieve this in multiple ways, such as using the `filter()` or `where()` methods, leveraging the DataFrame DSL, or employing a SQL query. Below, I will provide a comprehensive explanation along with examples to illustrate these approaches.

Approach 1: Using filter() Method

The `filter()` method in PySpark allows filtering rows based on a condition. You can use the `isin()` function to apply an SQL-like IN clause.

Here’s an example:


from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("Filter IN Clause").getOrCreate()

# Create a sample DataFrame
data = [(1, "Alice"), (2, "Bob"), (3, "Catherine"), (4, "David")]
df = spark.createDataFrame(data, ["id", "name"])

# Filter using IS IN clause
filtered_df = df.filter(df.name.isin("Alice", "David"))
filtered_df.show()

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  4|David|
+---+-----+

Approach 2: Using where() Method

The `where()` method serves the same purpose as `filter()`, and you can also use the `isin()` function with this method.

Here’s an example:


# Filter using WHERE clause
filtered_df = df.where(df.name.isin("Alice", "David"))
filtered_df.show()

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  4|David|
+---+-----+

Approach 3: Using SQL Query

You can register the DataFrame as a temporary table and then execute an SQL query that includes the IN clause.

Here’s an example:


# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

# Use spark.sql to execute an SQL query
filtered_df = spark.sql("SELECT * FROM people WHERE name IN ('Alice', 'David')")
filtered_df.show()

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  4|David|
+---+-----+

Comparison

Here is a simple comparison of the three approaches:

Approach Method Ease of Use
Approach 1 filter() Good for direct method chaining
Approach 2 where() Interchangeable with filter()
Approach 3 SQL Query Good for complex SQL operations and readability

All these methods are equally valid and can be used based on your preference or the specific requirements of your application. The `filter()` and `where()` methods are more suited for programmatic chaining, whereas the SQL approach can be more intuitive for those familiar with SQL syntax.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top