Filtering or including rows in a PySpark DataFrame using a list is a common operation. PySpark provides several ways to achieve this, but the most efficient method is to use the `isin()` function, which filters rows based on the values present in a list. Below, I will provide a detailed explanation, along with a code snippet in PySpark to demonstrate how this can be done.
Step-by-step guide to filter or include rows using a list
Let’s assume we have a DataFrame containing information about employees, and we want to filter the rows where the department is either ‘HR’ or ‘IT’.
1. Create a sample DataFrame
First, we will create a Spark session and a sample DataFrame for demonstration purposes.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()
# Define schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("department", StringType(), True)
])
# Create DataFrame
data = [
(1, "Alice", 30, "HR"),
(2, "Bob", 35, "Engineering"),
(3, "Charlie", 25, "IT"),
(4, "David", 40, "Finance"),
(5, "Eve", 28, "HR")
]
df = spark.createDataFrame(data, schema)
df.show()
+---+-------+---+----------+
| id| name|age|department|
+---+-------+---+----------+
| 1| Alice| 30| HR|
| 2| Bob| 35|Engineering|
| 3|Charlie| 25| IT|
| 4| David| 40| Finance|
| 5| Eve| 28| HR|
+---+-------+---+----------+
2. Define the list of departments
We create a Python list with the departments ‘HR’ and ‘IT’ that we want to filter for.
departments = ['HR', 'IT']
3. Filter the DataFrame using the isin()
method
We use the isin()
method to create a Boolean mask that will filter the DataFrame rows where the ‘department’ column value is in the given list.
filtered_df = df.filter(df.department.isin(departments))
filtered_df.show()
+---+-------+---+----------+
| id| name|age|department|
+---+-------+---+----------+
| 1| Alice| 30| HR|
| 3|Charlie| 25| IT|
| 5| Eve| 28| HR|
+---+-------+---+----------+
Conclusion
By following these steps, you can easily filter or include rows in a PySpark DataFrame using a list. The isin()
method is an efficient way to perform this operation. This method can be used for various types of columns and lists as needed.