How to Filter or Include Rows in PySpark DataFrame Using a List?

Filtering or including rows in a PySpark DataFrame using a list is a common operation. PySpark provides several ways to achieve this, but the most efficient method is to use the `isin()` function, which filters rows based on the values present in a list. Below, I will provide a detailed explanation, along with a code snippet in PySpark to demonstrate how this can be done.

Step-by-step guide to filter or include rows using a list

Let’s assume we have a DataFrame containing information about employees, and we want to filter the rows where the department is either ‘HR’ or ‘IT’.

1. Create a sample DataFrame

First, we will create a Spark session and a sample DataFrame for demonstration purposes.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()

# Define schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True)
])

# Create DataFrame
data = [
    (1, "Alice", 30, "HR"),
    (2, "Bob", 35, "Engineering"),
    (3, "Charlie", 25, "IT"),
    (4, "David", 40, "Finance"),
    (5, "Eve", 28, "HR")
]

df = spark.createDataFrame(data, schema)
df.show()

+---+-------+---+----------+
| id|   name|age|department|
+---+-------+---+----------+
|  1|  Alice| 30|        HR|
|  2|    Bob| 35|Engineering|
|  3|Charlie| 25|        IT|
|  4|  David| 40|    Finance|
|  5|    Eve| 28|        HR|
+---+-------+---+----------+

2. Define the list of departments

We create a Python list with the departments ‘HR’ and ‘IT’ that we want to filter for.


departments = ['HR', 'IT']

3. Filter the DataFrame using the isin() method

We use the isin() method to create a Boolean mask that will filter the DataFrame rows where the ‘department’ column value is in the given list.


filtered_df = df.filter(df.department.isin(departments))
filtered_df.show()

+---+-------+---+----------+
| id|   name|age|department|
+---+-------+---+----------+
|  1|  Alice| 30|        HR|
|  3|Charlie| 25|        IT|
|  5|    Eve| 28|        HR|
+---+-------+---+----------+

Conclusion

By following these steps, you can easily filter or include rows in a PySpark DataFrame using a list. The isin() method is an efficient way to perform this operation. This method can be used for various types of columns and lists as needed.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top