How to Use Multiple Conditions in PySpark’s When Clause?

If you’re working with PySpark and need to implement multiple conditional logic, you can use the `when` function along with the `&` (AND) or `|` (OR) operators to combine multiple conditions. The `when` function is part of PySpark’s `pyspark.sql.functions` module, and it’s typically used in conjunction with the `withColumn` method to create a new column based on these conditions.

Implementing Multiple Conditions using When Clause

Let’s go through an example where we use multiple conditions in the `when` clause. We’ll start by creating a DataFrame and then use multiple conditions to classify rows into categories.

Example Code


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Example App") \
    .getOrCreate()

# Create a DataFrame
data = [
    (1, 1000),
    (2, 1500),
    (3, 2000),
    (4, 1200)
]
columns = ["id", "salary"]

df = spark.createDataFrame(data, columns)

# Apply multiple conditions using when
df = df.withColumn(
    "salary_category",
    when((col("salary") < 1100), "Low")
    .when((col("salary") >= 1100) & (col("salary") < 1500), "Medium")
    .when((col("salary") >= 1500) & (col("salary") < 2000), "High")
    .otherwise("Very High")
)

df.show()

Output


+---+------+--------------+
| id|salary|salary_category|
+---+------+--------------+
|  1|  1000|           Low|
|  2|  1500|           High|
|  3|  2000|      Very High|
|  4|  1200|        Medium|
+---+------+--------------+

Explanation

  • Initialization: First, we initialize a Spark session, which is the entry point to use PySpark.
  • Creation of DataFrame: We create a DataFrame with columns `id` and `salary` using some example data.
  • Application of Conditions: We use the `withColumn` method to add a new column `salary_category` to the DataFrame. Within the `withColumn` method, we use the `when` function multiple times to evaluate different conditions on the `salary` column.
    • If the salary is less than 1100, it is categorized as “Low”.
    • If the salary is between 1100 and 1500 (inclusive), it is categorized as “Medium”.
    • If the salary is between 1500 and 2000 (inclusive), it is categorized as “High”.
    • Otherwise, the salary is categorized as “Very High”.
  • Show the DataFrame: Finally, we use the `show` method to display the DataFrame, which now includes the new `salary_category` column based on the multiple conditions specified.

This approach demonstrates how you can use multiple conditions in PySpark’s `when` clause to perform intricate data transformations efficiently.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top