Filtering a Spark DataFrame based on a list of values with additional criteria is a common operation in data processing. We can achieve this by using the `filter` or `where` methods along with logical conditions.
Let’s break down the process using PySpark as an example.
Example using PySpark
Scenario
Let’s assume you have a Spark DataFrame containing information about various products. You want to filter the DataFrame to include only products that belong to a specific list of categories and have a price greater than a certain value.
Creating the DataFrame
First, let’s create a sample DataFrame:
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()
# Sample data and schema
data = [
(1, "Product A", "Electronics", 100),
(2, "Product B", "Electronics", 200),
(3, "Product C", "Clothing", 150),
(4, "Product D", "Clothing", 80),
(5, "Product E", "Groceries", 60)
]
columns = ["ProductID", "ProductName", "Category", "Price"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
+---------+-----------+-----------+-----+
|ProductID|ProductName| Category|Price|
+---------+-----------+-----------+-----+
| 1| Product A|Electronics| 100|
| 2| Product B|Electronics| 200|
| 3| Product C| Clothing| 150|
| 4| Product D| Clothing| 80|
| 5| Product E| Groceries| 60|
+---------+-----------+-----------+-----+
Filtering the DataFrame
Now, let’s filter the DataFrame to include only products that are in the categories ‘Electronics’ or ‘Clothing’ and have a price greater than 100.
# List of categories to filter
categories = ["Electronics", "Clothing"]
filtered_df = df.filter((df.Category.isin(categories)) & (df.Price > 100))
# Show filtered DataFrame
filtered_df.show()
+---------+-----------+-----------+-----+
|ProductID|ProductName| Category|Price|
+---------+-----------+-----------+-----+
| 2| Product B|Electronics| 200|
| 3| Product C| Clothing| 150|
+---------+-----------+-----------+-----+
Explanation
Here, we use the `filter` method to apply the following conditions:
- `df.Category.isin(categories)`: This checks if the ‘Category’ column value is in the specified list of categories (`categories`).
- `df.Price > 100`: This checks if the ‘Price’ column value is greater than 100.
We combine these conditions using the `&` operator, which represents a logical AND.
Example using Scala
Let’s see a similar example using Scala:
Creating the DataFrame
First, let’s create a sample DataFrame:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.appName("FilterExample").getOrCreate()
import spark.implicits._
// Sample data and schema
val data = Seq(
(1, "Product A", "Electronics", 100),
(2, "Product B", "Electronics", 200),
(3, "Product C", "Clothing", 150),
(4, "Product D", "Clothing", 80),
(5, "Product E", "Groceries", 60)
)
val df = data.toDF("ProductID", "ProductName", "Category", "Price")
// Show the DataFrame
df.show()
+---------+-----------+-----------+-----+
|ProductID|ProductName| Category|Price|
+---------+-----------+-----------+-----+
| 1| Product A|Electronics| 100|
| 2| Product B|Electronics| 200|
| 3| Product C| Clothing| 150|
| 4| Product D| Clothing| 80|
| 5| Product E| Groceries| 60|
+---------+-----------+-----------+-----+
Filtering the DataFrame
Now, let’s filter the DataFrame to include only products that are in the categories ‘Electronics’ or ‘Clothing’ and have a price greater than 100:
// List of categories to filter
val categories = List("Electronics", "Clothing")
val filteredDF = df.filter(df("Category").isin(categories: _*) && df("Price") > 100)
// Show filtered DataFrame
filteredDF.show()
+---------+-----------+-----------+-----+
|ProductID|ProductName| Category|Price|
+---------+-----------+-----------+-----+
| 2| Product B|Electronics| 200|
| 3| Product C| Clothing| 150|
+---------+-----------+-----------+-----+
Explanation
In Scala, we also use the `filter` method with the following conditions:
- `df(“Category”).isin(categories: _*)`: The `isin` method is used to check if the ‘Category’ column value is in the specified list of categories.
- `df(“Price”) > 100`: This checks if the ‘Price’ column value is greater than 100.
We combine these conditions using the `&&` operator, which represents a logical AND.
Conclusion
Filtering a Spark DataFrame based on a list of values with additional criteria can be efficiently achieved using the `filter` or `where` methods. These examples demonstrate how to apply such filters using both PySpark and Scala.