Not IN/ISIN Operators in PySpark: Usage Explained

In data processing and analysis, filtering data is a staple operation, and PySpark, a Python library for Apache Spark, provides robust functionality for these tasks. Two frequently used filtering operations involve excluding rows based on their values. These operations are performed using the “NOT IN” or “IS NOT IN” conditions, which are similar to those used in SQL. In PySpark, similar functionality is achieved with the `isin` method combined with negation. This article explores the usage of the NOT IN/ISIN operators in PySpark, demonstrating their use in various scenarios with detailed examples.

Understanding PySpark DataFrames

Before diving into the specifics of the NOT IN/ISIN operators, it is important to understand the basic structure in which PySpark operates. PySpark works with DataFrames, which are distributed collections of data organized into named columns. These resemble tables in a relational database but with richer optimizations under the hood.

To work with PySpark DataFrames, you will typically perform a series of transformations and actions. Transformations modify the DataFrame, such as by filtering out certain rows, while actions trigger a computation and return results.

Filtering with NOT IN/ISIN Operators

Filtering is a transformation that allows you to specify conditions to exclude or include rows. In PySpark, the `Column.isin(*cols)` method is used to filter the DataFrame to only include rows where the column’s value is in a specified list. To achieve the inverse—excluding rows where the column’s value is in a specified list—you can negate the `isin` condition.

Using the `isin` Method

The `isin` method is straightforward to use. Here’s a simple example:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize a SparkSession
spark = SparkSession.builder.appName("example_not_in").getOrCreate()

# Create a PySpark DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns = ["Name", "ID"]
df = spark.createDataFrame(data, columns)

# Using `isin` to include rows where `Name` is either 'Alice' or 'Bob'
names_to_include = ['Alice', 'Bob']
df_filtered = df.where(col("Name").isin(names_to_include))
df_filtered.show()

Output:


+-----+---+
| Name| ID|
+-----+---+
|Alice|  1|
|  Bob|  2|
+-----+---+

In this example, the DataFrame `df_filtered` includes only rows where the `Name` column has values ‘Alice’ or ‘Bob’.

Negating the `isin` Method for NOT IN

To simulate the NOT IN operation, we combine the `isin` method with a negation. Here’s how:


# Using `isin` with negation to exclude rows where `Name` is 'Alice' or 'Bob'
names_to_exclude = ['Alice', 'Bob']
df_filtered_not_in = df.where(~col("Name").isin(names_to_exclude))
df_filtered_not_in.show()

Output:


+-------+---+
|   Name| ID|
+-------+---+
|Charlie|  3|
+-------+---+

In this code snippet, `df_filtered_not_in` excludes any rows where the `Name` column contains ‘Alice’ or ‘Bob’. The tilde (`~`) is a bitwise NOT operator in Python, which here serves to negate the condition produced by `isin`.

Handling NULL Values with NOT IN

When working with the NOT IN operator, one must be cautious about NULL values. In SQL, if one value in the list is NULL, the NOT IN condition will not match any rows. In PySpark, however, NULL values are handled differently, and the behavior is more intuitive. A NULL will not cause the rest of the conditions to fail, and it will behave as an unknown value. It’s good practice to handle NULLs explicitly if they are a concern in your dataset.

Example of Filtering with NULL Values

Let’s consider an example where we have NULL values, and we want to exclude certain names:


from pyspark.sql import Row

# Create a PySpark DataFrame with a NULL value
data_with_nulls = [("Alice", 1), ("Bob", 2), ("Charlie", 3), (None, 4)]
df_with_nulls = spark.createDataFrame(data_with_nulls, columns)

# Exclude 'Alice' and 'Bob', handling NULL correctly.
names_to_exclude = ['Alice', 'Bob']
df_filtered_with_nulls = df_with_nulls.where(
    (col("Name").isNotNull()) & (~col("Name").isin(names_to_exclude))
)
df_filtered_with_nulls.show()

Output:


+-------+---+
|   Name| ID|
+-------+---+
|Charlie|  3|
|   null|  4|
+-------+---+

In the output above, you can see that the NULL value was handled correctly and was not excluded by the NOT IN condition since we explicitly checked for non-NULL values using `col(“Name”).isNotNull()`.

Advanced Filtering with Multiple Columns

Sometimes, you may want to apply the NOT IN/ISIN operators across multiple columns. PySpark allows you to build complex conditions to achieve this.

Filtering Across Multiple Columns

The following example shows how you can filter out rows based on a combination of multiple column conditions:


# Create a more complex DataFrame
data_complex = [("Alice", 1, "Engineering"), ("Bob", 2, "Marketing"), ("Charlie", 3, "Engineering"), ("David", 4, "HR")]
columns_complex = ["Name", "ID", "Department"]
df_complex = spark.createDataFrame(data_complex, columns_complex)

# Define a condition for exclusion
condition = (
    (~col("Name").isin(["Alice", "Charlie"])) |
    (~col("Department").isin(["Engineering"]))
)

df_complex_filtered = df_complex.where(condition)
df_complex_filtered.show()

Output:


+-------+---+-----------+
|   Name| ID|Department |
+-------+---+-----------+
|    Bob|  2|  Marketing|
|  David|  4|         HR|
+-------+---+-----------+

Here, we’ve excluded rows where the `Name` is ‘Alice’ or ‘Charlie’ and the `Department` is ‘Engineering’. The result is that we only have ‘Bob’ from ‘Marketing’ and ‘David’ from ‘HR’.

Conclusion

PySpark offers versatile ways to filter out specific rows using the NOT IN/ISIN operators. By understanding and correctly implementing these techniques, you can effectively manipulate and analyze large datasets. The functionality provided by PySpark for filtering is both powerful and crucial for the preprocessing of data in various analytical workflows. As you integrate these capabilities into your Spark applications, remember to consider edge cases such as NULL values and evaluate filtering conditions carefully to avoid unexpected outcomes.

While these examples use simple datasets for clarity, the same principles apply when working with big data on distributed systems. PySpark’s ability to scale and process large volumes of data makes it a vital tool in the data engineer’s and data scientist’s toolkit. Learning and mastering the use of NOT IN/ISIN operations with PySpark can greatly enhance the quality and efficiency of your data processing tasks.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top