PySpark ISIN and IN Operator Tutorial

When working with PySpark, one often needs to filter data based on a specific condition or set of values. This is where the “ISIN” and “IN” operators come in handy. These operators allow us to select rows in a DataFrame where a column’s value matches any value in a specified list. This tutorial will provide an in-depth look at how to use the ISIN and IN operators in PySpark.

Understanding the ISIN Operator

The ISIN operator in PySpark is a column method that is used to filter data based on a list of values. “ISIN” stands for “is in” and checks whether a column’s value is present in a list of values provided by the user. If the value is present, the row is included in the filtered data.

Basic Usage of ISIN Operator

Here’s how you can use the ISIN operator to filter a DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize a SparkSession
spark = SparkSession.builder.appName("ISINOperatorExample").getOrCreate()

# Sample data
data = [("Alice", 1), ("Bob", 2), ("Catherine", 3), ("Denver", 4), ("Ella", 5)]

# Creating a DataFrame
df = spark.createDataFrame(data, ["Name", "ID"])

# List of values to filter on
values_to_filter = [1, 3]

# Using ISIN operator to filter the DataFrame
filtered_df = df.where(col("ID").isin(values_to_filter))

# Show the filtered DataFrame
filtered_df.show()

If everything is set up correctly, when running this code, you should get the following output:


+---------+---+
|     Name| ID|
+---------+---+
|    Alice|  1|
|Catherine|  3|
+---------+---+

This output shows that only the rows with ID 1 and 3 are part of the resulting DataFrame, as those are the ones we specified in our values_to_filter list.

Using ISIN with Multiple Columns

You can also use the ISIN operator with multiple columns. Here’s an example:


# Additional column added
data = [("Alice", 1, "Engineering"), ("Bob", 2, "Sales"), ("Catherine", 3, "Engineering"), ("Denver", 4, "Marketing"), ("Ella", 5, "Sales")]

# Creating a new DataFrame with an additional 'Department' column
df = spark.createDataFrame(data, ["Name", "ID", "Department"])

# List of departments to filter on
departments_to_filter = ["Engineering", "Sales"]

# Using ISIN operator to filter the DataFrame for multiple columns
filtered_df = df.where(col("Department").isin(departments_to_filter))

# Show the filtered DataFrame
filtered_df.show()

The output of this code snippet would be:


+---------+---+-----------+
|     Name| ID|Department|
+---------+---+-----------+
|    Alice|  1|Engineering|
|      Bob|  2|      Sales|
|Catherine|  3|Engineering|
|     Ella|  5|      Sales|
+---------+---+-----------+

In this example, the data is filtered based on the “Department” column, and we only get the rows where the department is either “Engineering” or “Sales”.

Understanding the IN Operator

While the ISIN operator is a column method in PySpark DataFrame API, the IN operator is SQL syntax used with PySpark SQL. The IN operator checks whether a column’s value matches any value within a list provided in a SQL query.

Using IN Operator in PySpark SQL Queries

To use the IN operator, you’ll need to register your DataFrame as a temporary SQL table and then execute a SQL query against it. Here’s how:


# Register the DataFrame as a temporary view
df.createOrReplaceTempView("people")

# SQL query using IN operator
query = "SELECT * FROM people WHERE ID IN (1, 3)"

# Execute the SQL query
filtered_df_sql = spark.sql(query)

# Show the DataFrame
filtered_df_sql.show()

The output will be the same as the one obtained with the ISIN operator:


+---------+---+
|     Name| ID|
+---------+---+
|    Alice|  1|
|Catherine|  3|
+---------+---+

Note that in this example, we’ve hard-coded the values (1 and 3) directly into the SQL query, so we don’t need an external list of values as we did with ISIN.

Conclusion

In this tutorial, we’ve covered how to use the ISIN and IN operators in PySpark to filter data based on a list of values. The ISIN operator is useful when working with DataFrames, while the IN operator is used within PySpark SQL queries. Both can be used effectively to select rows based on value matching, making your data analysis tasks more efficient and straightforward.

Understanding how to leverage these operators can greatly enhance your data processing workflows in PySpark, particularly when dealing with large-scale data and the need for precise filtering. With the ability to compare columns against dynamic or static lists, you have the flexibility to create complex queries and transformations, all within the powerful framework that PySpark provides.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top