Great question! PySpark’s withColumn()
is fundamental for data transformation in DataFrame operations. Often, one needs to apply conditions to modify or create new columns. If you have two conditions and three outcomes, you can use the when()
and otherwise()
functions from PySpark’s pyspark.sql.functions
module. Let’s dive into an example.
Scenario
Suppose you have a DataFrame of employees’ details and you want to assign a rating based on their age and salary. The conditions are:
- If age > 30 and salary > 5000, set the ‘rating’ to ‘A’.
- If age <= 30 and salary > 3000, set the ‘rating’ to ‘B’.
- Otherwise, set the ‘rating’ to ‘C’.
Example in PySpark
Here’s an example of how you can achieve this using PySpark:
### Create a Sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
# Initialize Spark session
spark = SparkSession.builder.appName("withColumnExample").getOrCreate()
# Sample data
data = [
("John", 28, 4000),
("Anna", 40, 6000),
("Mike", 30, 3000),
("Sara", 32, 4500)
]
# Columns
columns = ["name", "age", "salary"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()
+----+---+------+
|name|age|salary|
+----+---+------+
|John| 28| 4000|
|Anna| 40| 6000|
|Mike| 30| 3000|
|Sara| 32| 4500|
+----+---+------+
### Apply Conditions and Add ‘rating’ Column
# Applying conditions and adding 'rating' column
df = df.withColumn(
"rating",
when((col("age") > 30) & (col("salary") > 5000), "A")
.when((col("age") <= 30) & (col("salary") > 3000), "B")
.otherwise("C")
)
df.show()
+----+---+------+------+
|name|age|salary|rating|
+----+---+------+------+
|John| 28| 4000| B|
|Anna| 40| 6000| A|
|Mike| 30| 3000| C|
|Sara| 32| 4500| C|
+----+---+------+------+
Explanation
In this example:
- We first create a Spark session and sample DataFrame.
- We then use
withColumn()
to add a new column namedrating
by applying the conditions usingwhen()
andotherwise()
functions. - The first
when()
checks if age is greater than 30 and salary is greater than 5000. If true, it assigns ‘A’ to the rating. - The second
when()
checks if age is less than or equal to 30 and salary is greater than 3000. If true, it assigns ‘B’ to the rating. - The
otherwise()
function assigns ‘C’ to all other cases.
This demonstrates how to use PySpark’s withColumn()
to perform multiple condition checks and assign multiple outcomes to a new column. I hope this helps!