Filtering rows in DataFrames based on multiple conditions is a common operation in Spark. You can achieve this by using logical operators such as `&` (and), `|` (or), `~` (not) in combination with the `filter` or `where` methods. Below, I’ll demonstrate this in both PySpark (Python) and Scala.
PySpark (Python)
Here’s an example of how to apply multiple conditions using PySpark:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("filter_example").getOrCreate()
# Sample data
data = [
("Alice", 28, "F"),
("Bob", 35, "M"),
("Carol", 29, "F"),
("Dave", 45, "M")
]
# Create DataFrame
columns = ["Name", "Age", "Gender"]
df = spark.createDataFrame(data, columns)
# Apply multiple conditions
filtered_df = df.filter((df.Age > 30) & (df.Gender == "M"))
# Show results
filtered_df.show()
+----+---+------+
|Name|Age|Gender|
+----+---+------+
| Bob| 35| M|
|Dave| 45| M|
+----+---+------+
Scala
Similarly, you can apply multiple conditions using Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.appName("filter_example").getOrCreate()
// Sample data
val data = Seq(
("Alice", 28, "F"),
("Bob", 35, "M"),
("Carol", 29, "F"),
("Dave", 45, "M")
)
// Create DataFrame
val columns = Seq("Name", "Age", "Gender")
import spark.implicits._
val df = data.toDF(columns: _*)
// Apply multiple conditions
val filteredDF = df.filter($"Age" > 30 && $"Gender" === "M")
// Show results
filteredDF.show()
+----+---+------+
|Name|Age|Gender|
+----+---+------+
| Bob| 35| M|
|Dave| 45| M|
+----+---+------+
Explanation
In both PySpark and Scala examples:
- The DataFrame is initialized with some sample data.
- Multiple conditions are applied using `&` operator (logical AND in PySpark) and `&&` in Scala.
- The `filter` method (which is an alias for `where` method) is used to filter rows that meet both conditions.
- The `show` method is called to display the filtered rows.
You can also use other logical operators like `|` for logical OR and `~` for logical NOT to apply various combinations of conditions as per your requirements.