How to Filter a Spark DataFrame by Checking Values in a List with Additional Criteria?

Filtering a Spark DataFrame based on a list of values with additional criteria is a common operation in data processing. We can achieve this by using the `filter` or `where` methods along with logical conditions.

Let’s break down the process using PySpark as an example.

Example using PySpark

Scenario

Let’s assume you have a Spark DataFrame containing information about various products. You want to filter the DataFrame to include only products that belong to a specific list of categories and have a price greater than a certain value.

Creating the DataFrame

First, let’s create a sample DataFrame:


from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()

# Sample data and schema
data = [
    (1, "Product A", "Electronics", 100),
    (2, "Product B", "Electronics", 200),
    (3, "Product C", "Clothing", 150),
    (4, "Product D", "Clothing", 80),
    (5, "Product E", "Groceries", 60)
]
columns = ["ProductID", "ProductName", "Category", "Price"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

+---------+-----------+-----------+-----+
|ProductID|ProductName|   Category|Price|
+---------+-----------+-----------+-----+
|        1|  Product A|Electronics|  100|
|        2|  Product B|Electronics|  200|
|        3|  Product C|   Clothing|  150|
|        4|  Product D|   Clothing|   80|
|        5|  Product E|  Groceries|   60|
+---------+-----------+-----------+-----+

Filtering the DataFrame

Now, let’s filter the DataFrame to include only products that are in the categories ‘Electronics’ or ‘Clothing’ and have a price greater than 100.


# List of categories to filter
categories = ["Electronics", "Clothing"]

filtered_df = df.filter((df.Category.isin(categories)) & (df.Price > 100))

# Show filtered DataFrame
filtered_df.show()

+---------+-----------+-----------+-----+
|ProductID|ProductName|   Category|Price|
+---------+-----------+-----------+-----+
|        2|  Product B|Electronics|  200|
|        3|  Product C|   Clothing|  150|
+---------+-----------+-----------+-----+

Explanation

Here, we use the `filter` method to apply the following conditions:

  • `df.Category.isin(categories)`: This checks if the ‘Category’ column value is in the specified list of categories (`categories`).
  • `df.Price > 100`: This checks if the ‘Price’ column value is greater than 100.

We combine these conditions using the `&` operator, which represents a logical AND.

Example using Scala

Let’s see a similar example using Scala:

Creating the DataFrame

First, let’s create a sample DataFrame:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("FilterExample").getOrCreate()

import spark.implicits._

// Sample data and schema
val data = Seq(
    (1, "Product A", "Electronics", 100),
    (2, "Product B", "Electronics", 200),
    (3, "Product C", "Clothing", 150),
    (4, "Product D", "Clothing", 80),
    (5, "Product E", "Groceries", 60)
)
val df = data.toDF("ProductID", "ProductName", "Category", "Price")

// Show the DataFrame
df.show()

+---------+-----------+-----------+-----+
|ProductID|ProductName|   Category|Price|
+---------+-----------+-----------+-----+
|        1|  Product A|Electronics|  100|
|        2|  Product B|Electronics|  200|
|        3|  Product C|   Clothing|  150|
|        4|  Product D|   Clothing|   80|
|        5|  Product E|  Groceries|   60|
+---------+-----------+-----------+-----+

Filtering the DataFrame

Now, let’s filter the DataFrame to include only products that are in the categories ‘Electronics’ or ‘Clothing’ and have a price greater than 100:


// List of categories to filter
val categories = List("Electronics", "Clothing")

val filteredDF = df.filter(df("Category").isin(categories: _*) && df("Price") > 100)

// Show filtered DataFrame
filteredDF.show()

+---------+-----------+-----------+-----+
|ProductID|ProductName|   Category|Price|
+---------+-----------+-----------+-----+
|        2|  Product B|Electronics|  200|
|        3|  Product C|   Clothing|  150|
+---------+-----------+-----------+-----+

Explanation

In Scala, we also use the `filter` method with the following conditions:

  • `df(“Category”).isin(categories: _*)`: The `isin` method is used to check if the ‘Category’ column value is in the specified list of categories.
  • `df(“Price”) > 100`: This checks if the ‘Price’ column value is greater than 100.

We combine these conditions using the `&&` operator, which represents a logical AND.

Conclusion

Filtering a Spark DataFrame based on a list of values with additional criteria can be efficiently achieved using the `filter` or `where` methods. These examples demonstrate how to apply such filters using both PySpark and Scala.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top