If you’re working with PySpark and need to implement multiple conditional logic, you can use the `when` function along with the `&` (AND) or `|` (OR) operators to combine multiple conditions. The `when` function is part of PySpark’s `pyspark.sql.functions` module, and it’s typically used in conjunction with the `withColumn` method to create a new column based on these conditions.
Implementing Multiple Conditions using When Clause
Let’s go through an example where we use multiple conditions in the `when` clause. We’ll start by creating a DataFrame and then use multiple conditions to classify rows into categories.
Example Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
# Initialize Spark Session
spark = SparkSession.builder \
.appName("Example App") \
.getOrCreate()
# Create a DataFrame
data = [
(1, 1000),
(2, 1500),
(3, 2000),
(4, 1200)
]
columns = ["id", "salary"]
df = spark.createDataFrame(data, columns)
# Apply multiple conditions using when
df = df.withColumn(
"salary_category",
when((col("salary") < 1100), "Low")
.when((col("salary") >= 1100) & (col("salary") < 1500), "Medium")
.when((col("salary") >= 1500) & (col("salary") < 2000), "High")
.otherwise("Very High")
)
df.show()
Output
+---+------+--------------+
| id|salary|salary_category|
+---+------+--------------+
| 1| 1000| Low|
| 2| 1500| High|
| 3| 2000| Very High|
| 4| 1200| Medium|
+---+------+--------------+
Explanation
- Initialization: First, we initialize a Spark session, which is the entry point to use PySpark.
- Creation of DataFrame: We create a DataFrame with columns `id` and `salary` using some example data.
- Application of Conditions: We use the `withColumn` method to add a new column `salary_category` to the DataFrame. Within the `withColumn` method, we use the `when` function multiple times to evaluate different conditions on the `salary` column.
- If the salary is less than 1100, it is categorized as “Low”.
- If the salary is between 1100 and 1500 (inclusive), it is categorized as “Medium”.
- If the salary is between 1500 and 2000 (inclusive), it is categorized as “High”.
- Otherwise, the salary is categorized as “Very High”.
- Show the DataFrame: Finally, we use the `show` method to display the DataFrame, which now includes the new `salary_category` column based on the multiple conditions specified.
This approach demonstrates how you can use multiple conditions in PySpark’s `when` clause to perform intricate data transformations efficiently.