How to Use PySpark withColumn() for Two Conditions and Three Outcomes?

Great question! PySpark’s withColumn() is fundamental for data transformation in DataFrame operations. Often, one needs to apply conditions to modify or create new columns. If you have two conditions and three outcomes, you can use the when() and otherwise() functions from PySpark’s pyspark.sql.functions module. Let’s dive into an example.

Scenario

Suppose you have a DataFrame of employees’ details and you want to assign a rating based on their age and salary. The conditions are:

  • If age > 30 and salary > 5000, set the ‘rating’ to ‘A’.
  • If age <= 30 and salary > 3000, set the ‘rating’ to ‘B’.
  • Otherwise, set the ‘rating’ to ‘C’.

Example in PySpark

Here’s an example of how you can achieve this using PySpark:

### Create a Sample DataFrame


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

# Initialize Spark session
spark = SparkSession.builder.appName("withColumnExample").getOrCreate()

# Sample data
data = [
    ("John", 28, 4000),
    ("Anna", 40, 6000),
    ("Mike", 30, 3000),
    ("Sara", 32, 4500)
]

# Columns
columns = ["name", "age", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()

+----+---+------+
|name|age|salary|
+----+---+------+
|John| 28|  4000|
|Anna| 40|  6000|
|Mike| 30|  3000|
|Sara| 32|  4500|
+----+---+------+

### Apply Conditions and Add ‘rating’ Column


# Applying conditions and adding 'rating' column
df = df.withColumn(
    "rating", 
    when((col("age") > 30) & (col("salary") > 5000), "A")
    .when((col("age") <= 30) & (col("salary") > 3000), "B")
    .otherwise("C")
)

df.show()

+----+---+------+------+
|name|age|salary|rating|
+----+---+------+------+
|John| 28|  4000|     B|
|Anna| 40|  6000|     A|
|Mike| 30|  3000|     C|
|Sara| 32|  4500|     C|
+----+---+------+------+

Explanation

In this example:

  • We first create a Spark session and sample DataFrame.
  • We then use withColumn() to add a new column named rating by applying the conditions using when() and otherwise() functions.
  • The first when() checks if age is greater than 30 and salary is greater than 5000. If true, it assigns ‘A’ to the rating.
  • The second when() checks if age is less than or equal to 30 and salary is greater than 3000. If true, it assigns ‘B’ to the rating.
  • The otherwise() function assigns ‘C’ to all other cases.

This demonstrates how to use PySpark’s withColumn() to perform multiple condition checks and assign multiple outcomes to a new column. I hope this helps!

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top