How to Apply Multiple Conditions for Filtering in Spark DataFrames?

Filtering rows in DataFrames based on multiple conditions is a common operation in Spark. You can achieve this by using logical operators such as `&` (and), `|` (or), `~` (not) in combination with the `filter` or `where` methods. Below, I’ll demonstrate this in both PySpark (Python) and Scala.

PySpark (Python)

Here’s an example of how to apply multiple conditions using PySpark:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("filter_example").getOrCreate()

# Sample data
data = [
    ("Alice", 28, "F"),
    ("Bob", 35, "M"),
    ("Carol", 29, "F"),
    ("Dave", 45, "M")
]

# Create DataFrame
columns = ["Name", "Age", "Gender"]
df = spark.createDataFrame(data, columns)

# Apply multiple conditions
filtered_df = df.filter((df.Age > 30) & (df.Gender == "M"))

# Show results
filtered_df.show()

+----+---+------+
|Name|Age|Gender|
+----+---+------+
| Bob| 35|     M|
|Dave| 45|     M|
+----+---+------+

Scala

Similarly, you can apply multiple conditions using Scala:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("filter_example").getOrCreate()

// Sample data
val data = Seq(
  ("Alice", 28, "F"),
  ("Bob", 35, "M"),
  ("Carol", 29, "F"),
  ("Dave", 45, "M")
)

// Create DataFrame
val columns = Seq("Name", "Age", "Gender")
import spark.implicits._
val df = data.toDF(columns: _*)

// Apply multiple conditions
val filteredDF = df.filter($"Age" > 30 && $"Gender" === "M")

// Show results
filteredDF.show()

+----+---+------+
|Name|Age|Gender|
+----+---+------+
| Bob| 35|     M|
|Dave| 45|     M|
+----+---+------+

Explanation

In both PySpark and Scala examples:

  • The DataFrame is initialized with some sample data.
  • Multiple conditions are applied using `&` operator (logical AND in PySpark) and `&&` in Scala.
  • The `filter` method (which is an alias for `where` method) is used to filter rows that meet both conditions.
  • The `show` method is called to display the filtered rows.

You can also use other logical operators like `|` for logical OR and `~` for logical NOT to apply various combinations of conditions as per your requirements.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top