How to Filter or Include Rows in PySpark DataFrame Using a List?

Filtering or including rows in a PySpark DataFrame using a list is a common operation. PySpark provides several ways to achieve this, but the most efficient method is to use the `isin()` function, which filters rows based on the values present in a list. Below, I will provide a detailed explanation, along with a code snippet in PySpark to demonstrate how this can be done.

Step-by-step guide to filter or include rows using a list

Let’s assume we have a DataFrame containing information about employees, and we want to filter the rows where the department is either ‘HR’ or ‘IT’.

1. Create a sample DataFrame

First, we will create a Spark session and a sample DataFrame for demonstration purposes.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()

# Define schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True)
])

# Create DataFrame
data = [
    (1, "Alice", 30, "HR"),
    (2, "Bob", 35, "Engineering"),
    (3, "Charlie", 25, "IT"),
    (4, "David", 40, "Finance"),
    (5, "Eve", 28, "HR")
]

df = spark.createDataFrame(data, schema)
df.show()

+---+-------+---+----------+
| id|   name|age|department|
+---+-------+---+----------+
|  1|  Alice| 30|        HR|
|  2|    Bob| 35|Engineering|
|  3|Charlie| 25|        IT|
|  4|  David| 40|    Finance|
|  5|    Eve| 28|        HR|
+---+-------+---+----------+

2. Define the list of departments

We create a Python list with the departments ‘HR’ and ‘IT’ that we want to filter for.


departments = ['HR', 'IT']

3. Filter the DataFrame using the isin() method

We use the isin() method to create a Boolean mask that will filter the DataFrame rows where the ‘department’ column value is in the given list.


filtered_df = df.filter(df.department.isin(departments))
filtered_df.show()

+---+-------+---+----------+
| id|   name|age|department|
+---+-------+---+----------+
|  1|  Alice| 30|        HR|
|  3|Charlie| 25|        IT|
|  5|    Eve| 28|        HR|
+---+-------+---+----------+

Conclusion

By following these steps, you can easily filter or include rows in a PySpark DataFrame using a list. The isin() method is an efficient way to perform this operation. This method can be used for various types of columns and lists as needed.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top