How to Conditionally Replace Values in a PySpark Column Based on Another Column?

Conditionally replacing values in a PySpark DataFrame based on another column is a common task in data preprocessing. You can achieve this by using the `when` and `otherwise` functions from the `pyspark.sql.functions` module. Here, I’ll walk you through the process using a practical example.

Contents hide

1 Example

1.1 Step-by-Step Implementation

1.1.1 Step 1: Import Necessary Modules

1.1.2 Step 2: Create a Spark Session

1.1.3 Step 3: Create a Sample DataFrame

1.1.4 Step 4: Apply Conditional Logic

1.2 Explanation

1.3 Conclusion

2 About Editorial Team

3 You Might Also Like:

Example

Let’s consider a DataFrame with two columns: `age` and `category`. You want to create a new column called `age_group` based on conditions applied to the `age` column. Specifically:

– If `age` is less than 18, the `age_group` should be “child”.
– If `age` is between 18 and 65, the `age_group` should be “adult”.
– Otherwise, the `age_group` should be “senior”.

Step-by-Step Implementation

Step 1: Import Necessary Modules


from pyspark.sql import SparkSession
from pyspark.sql.functions import when

The above code imports the necessary libraries for our task. SparkSession is used to create the Spark context, and when is used to conditionally replace values.

Step 2: Create a Spark Session


spark = SparkSession.builder.appName("Conditional Replace Example").getOrCreate()

The above code creates a Spark session, which is the entry point for interacting with Spark.

Step 3: Create a Sample DataFrame


data = [(10, "some_category"),
        (25, "some_category"),
        (70, "some_category")]

columns = ["age", "category"]

df = spark.createDataFrame(data, schema=columns)
df.show()

This will create a DataFrame with the following data:


+---+-------------+
|age|     category|
+---+-------------+
| 10|some_category|
| 25|some_category|
| 70|some_category|
+---+-------------+

Step 4: Apply Conditional Logic


df = df.withColumn(
    "age_group",
    when(df.age < 18, "child")
    .when((df.age >= 18) & (df.age <= 65), "adult")
    .otherwise("senior")
)
df.show()

This code adds a new column, `age_group`, to the DataFrame, applying the conditional logic as described above.


+---+-------------+---------+
|age|     category|age_group|
+---+-------------+---------+
| 10|some_category|    child|
| 25|some_category|    adult|
| 70|some_category|   senior|
+---+-------------+---------+

Explanation

The `when` function allows you to specify conditions and corresponding values. The first argument is the condition, and the second argument is the value to assign if the condition is true. The `otherwise` function specifies the value to assign if none of the conditions are met.

Conclusion

By using the `when` and `otherwise` functions, you can efficiently replace values in a PySpark DataFrame column based on conditions applied to another column. This is a versatile approach suitable for a variety of data preprocessing tasks.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.