How Does CASE WHEN Work in Spark SQL?

Let’s delve deep into how the `CASE WHEN` statement operates in Spark SQL. This conditional expression is a powerful tool that allows you to apply if-else logic within your SQL queries.

Contents hide

1 Understanding `CASE WHEN` Syntax

2 Example in Spark SQL

3 Working with DataFrame API

4 Advantages of Using `CASE WHEN`

5 Conclusion

6 About Editorial Team

7 You Might Also Like:

Understanding `CASE WHEN` Syntax

The `CASE WHEN` statement in Spark SQL is used to create conditional logic. Here’s the basic syntax:

“`sql
SELECT
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
…
ELSE default_result
END
FROM table_name;
“`

Each `WHEN` clause checks for a condition, and if the condition is true, the corresponding `THEN` result is returned. The `ELSE` clause provides a default result if none of the conditions are met.

Example in Spark SQL

Let’s look at an example to understand this better. Suppose we have a DataFrame containing information about employees, and we want to categorize them based on their salaries:

“`[python]
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName(“CaseWhenExample”).getOrCreate()

# Sample data
data = [
(“John”, 3000),
(“Doe”, 4500),
(“Jane”, 2000),
(“Smith”, 7000)
]

# Creating DataFrame
df = spark.createDataFrame(data, [“Name”, “Salary”])

# Using CASE WHEN
df_with_category = df.withColumn(“Category”, expr(“””
CASE
WHEN Salary < 3000 THEN 'Low' WHEN Salary >= 3000 AND Salary <= 5000 THEN 'Medium' ELSE 'High' END """)) df_with_category.show() ```

The output of this SQL query will be:


+-----+------+--------+
| Name|Salary|Category|
+-----+------+--------+
| John|  3000|  Medium|
|  Doe|  4500|  Medium|
| Jane|  2000|     Low|
|Smith|  7000|    High|
+-----+------+--------+

Working with DataFrame API

The same logic can be implemented using Spark’s DataFrame API for those who prefer to avoid raw SQL syntax. The `when` function from `pyspark.sql.functions` can be used for this:

“`[python]
from pyspark.sql.functions import when

df_with_category = df.withColumn(
“Category”,
when(df.Salary < 3000, "Low") .when((df.Salary >= 3000) & (df.Salary <= 5000), "Medium") .otherwise("High") ) df_with_category.show() ```

The output will be the same:


+-----+------+--------+
| Name|Salary|Category|
+-----+------+--------+
| John|  3000|  Medium|
|  Doe|  4500|  Medium|
| Jane|  2000|     Low|
|Smith|  7000|    High|
+-----+------+--------+

Advantages of Using `CASE WHEN`

Using `CASE WHEN` in Spark SQL has several benefits:

Readability: Improves the readability of your queries by clearly defining complex conditional logic.
Flexibility: Highly versatile for various scenarios, such as complex transformations, aggregations, and filtering.
Performance: Optimized by Spark’s Catalyst optimizer for efficient execution plans.

Conclusion

The `CASE WHEN` statement in Spark SQL is a robust tool for embedding conditional logic within your queries. It enhances query readability and can be efficiently implemented both in SQL syntax and through the DataFrame API. Understanding and using this feature will significantly improve your ability to perform complex data manipulations in Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.