One of the many powerful features of PySpark is its ability to handle conditional logic to manipulate and analyze data. In this article, we’ll dive into the use of “when” and “otherwise” for conditional logic in PySpark.
Understanding PySpark “when” and “otherwise”
In PySpark, the “when” function is used to evaluate a column’s value against specified conditions. It is very similar to SQL’s “CASE WHEN” or Python’s “if-elif-else” expressions. You can think of “when” as a way to create a new column in a DataFrame based on certain conditions. The “otherwise” function is often used in conjunction with “when” and can be thought of as the “else” part in an “if-else” statement. It provides a default value for records that do not match any of the specified conditions.
Basic Usage of “when”
Let’s start with a basic example that demonstrates the use of “when” to create a new column that categorizes values from another column.
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
# Initialize a Spark session
spark = SparkSession.builder.appName("whenOtherwiseExample").getOrCreate()
# Create a DataFrame
data = [("James", "Smith", "USA", "CA"),
("Michael", "Rose", "USA", "NY"),
("Robert", "Williams", "USA", "CA"),
("Maria", "Jones", "USA", "FL")]
columns = ["firstname", "lastname", "country", "state"]
df = spark.createDataFrame(data=data, schema=columns)
# Use "when" to create a new column based on a condition
df = df.withColumn("state_category", when(df.state == "CA", "West Coast").otherwise("Other"))
df.show()
The output of the code snippet will be:
+---------+--------+-------+-----+-------------+
|firstname|lastname|country|state|state_category|
+---------+--------+-------+-----+-------------+
| James| Smith| USA| CA| West Coast|
| Michael| Rose| USA| NY| Other|
| Robert|Williams| USA| CA| West Coast|
| Maria| Jones| USA| FL| Other|
+---------+--------+-------+-----+-------------+
As seen in the output, the “state_category” column is added to the DataFrame, and it categorizes each state as either “West Coast” if the state is “CA” or “Other” if it is not.
Chaining Multiple Conditions
You can chain multiple “when” conditions to build more complex logic. This is analogous to using multiple “if-elif” blocks in Python. Let’s add more conditions to our state categorization.
from pyspark.sql.functions import col
# Chain multiple "when" conditions
df = df.withColumn("state_category",
when(col("state") == "CA", "West Coast")
.when(col("state") == "NY", "East Coast")
.when(col("state") == "FL", "South")
.otherwise("Other"))
df.show()
With the chained “when” conditions, the output will be updated as follows:
+---------+--------+-------+-----+-------------+
|firstname|lastname|country|state|state_category|
+---------+--------+-------+-----+-------------+
| James| Smith| USA| CA| West Coast|
| Michael| Rose| USA| NY| East Coast|
| Robert|Williams| USA| CA| West Coast|
| Maria| Jones| USA| FL| South|
+---------+--------+-------+-----+-------------+
We can see that we now have additional categories such as “East Coast” for New York and “South” for Florida.
Using “when” with Multiple Column Conditions
Conditional logic can also be based on multiple columns at once. For example, we can create a new column that provides information based on both the “state” and “country” columns.
# Use "when" with conditions on multiple columns
df = df.withColumn("region",
when((col("state") == "CA") & (col("country") == "USA"), "USA - West Coast")
.when((col("state") == "NY") & (col("country") == "USA"), "USA - East Coast")
.when((col("state") == "FL") & (col("country") == "USA"), "USA - South")
.otherwise("Other Region"))
df.show()
The output showing the new “region” column would look like this:
+---------+--------+-------+-----+-------------+----------------+
|firstname|lastname|country|state|state_category| region|
+---------+--------+-------+-----+-------------+----------------+
| James| Smith| USA| CA| West Coast|USA - West Coast|
| Michael| Rose| USA| NY| East Coast|USA - East Coast|
| Robert|Williams| USA| CA| West Coast|USA - West Coast|
| Maria| Jones| USA| FL| South| USA - South|
+---------+--------+-------+-----+-------------+----------------+
In this code snippet, we’ve added a new “region” column that combines both the “state” and “country” columns to provide a more specific region description.
Handling Nulls with “when” and “otherwise”
Null values can often be a point of concern while performing conditional logic. Fortunately, “when” and “otherwise” can handle nulls gracefully. By default, if no condition is met and no “otherwise” is specified, the result is null. However, you can set a specific default for null values as well.
# Handle nulls with "when" and "otherwise"
df = df.withColumn("contains_a",
when(col("firstname").contains("a"), "Has A")
.otherwise("No A"))
df.show()
This would add a column “contains_a” that checks if the first name contains the letter ‘a’:
+---------+--------+-------+-----+-------------+----------------+--------+
|firstname|lastname|country|state|state_category| region|contains_a|
+---------+--------+-------+-----+-------------+----------------+--------+
| James| Smith| USA| CA| West Coast|USA - West Coast| No A|
| Michael| Rose| USA| NY| East Coast|USA - East Coast| Has A|
| Robert|Williams| USA| CA| West Coast|USA - West Coast| Has A|
| Maria| Jones| USA| FL| South| USA - South| Has A|
+---------+--------+-------+-----+-------------+----------------+--------+
If any of the “firstname” values were null, the “contains_a” column for that row would be “No A”, as specified by the “otherwise” clause.
Advanced Uses of “when” and “otherwise”
Beyond the basics, “when” and “otherwise” can be used for more advanced transformations and analyses. These expressions can be nested, used with aggregates, and combined with user-defined functions (UDFs) to enable complex operations on DataFrames.
Nesting “when” Expressions
You can nest “when” expressions to handle multiple layers of conditional logic. This is useful when you have a hierarchy of conditions to check.
# Nesting "when" expressions
df = df.withColumn("name_check",
when(col("firstname").startswith("J"),
when(col("lastname").startswith("S"), "J and S")
.otherwise("J but not S"))
.otherwise("Not starting with J"))
df.show()
The output with the nested conditions is as follows:
+---------+--------+-------+-----+-------------+----------------+--------+------------------+
|firstname|lastname|country|state|state_category| region|contains_a| name_check|
+---------+--------+-------+-----+-------------+----------------+--------+------------------+
| James| Smith| USA| CA| West Coast|USA - West Coast| No A| J and S|
| Michael| Rose| USA| NY| East Coast|USA - East Coast| Has A|Not starting with J|
| Robert|Williams| USA| CA| West Coast|USA - West Coast| Has A|Not starting with J|
| Maria| Jones| USA| FL| South| USA - South| Has A|Not starting with J|
+---------+--------+-------+-----+-------------+----------------+--------+------------------+
In this example, we are checking if the first name starts with “J” and then further examining if the last name starts with “S” to create a very specific categorization in the “name_check” column.
Conclusion
Using PySpark’s “when” and “otherwise” functions can greatly enhance your ability to perform complex conditional data transformations. These expressions can emulate the behavior of traditional conditional statements in SQL and programming languages, providing PySpark users with a familiar and powerful tool for data manipulation. Whether you’re dealing with simple logic or intricate conditions spanning multiple columns and nested expressions, mastering “when” and “otherwise” can improve both the readability and functionality of your Spark data pipelines.
Remember that with the power of Apache Spark, these operations can be performed at scale over large datasets, making them extremely useful for big data processing tasks. Experiment with your own datasets and start integrating conditional logic into your Spark workflows to get the most out of your data analysis.
As you grow more comfortable with Spark and PySpark, you will find that incorporating advanced analytical functions and fine-tuning your data transformations will lead to more insightful analyses and streamlined data processing procedures.