Conditionally replacing values in a PySpark DataFrame based on another column is a common task in data preprocessing. You can achieve this by using the `when` and `otherwise` functions from the `pyspark.sql.functions` module. Here, I’ll walk you through the process using a practical example.
Example
Let’s consider a DataFrame with two columns: `age` and `category`. You want to create a new column called `age_group` based on conditions applied to the `age` column. Specifically:
– If `age` is less than 18, the `age_group` should be “child”.
– If `age` is between 18 and 65, the `age_group` should be “adult”.
– Otherwise, the `age_group` should be “senior”.
Step-by-Step Implementation
Step 1: Import Necessary Modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
The above code imports the necessary libraries for our task. SparkSession is used to create the Spark context, and when is used to conditionally replace values.
Step 2: Create a Spark Session
spark = SparkSession.builder.appName("Conditional Replace Example").getOrCreate()
The above code creates a Spark session, which is the entry point for interacting with Spark.
Step 3: Create a Sample DataFrame
data = [(10, "some_category"),
(25, "some_category"),
(70, "some_category")]
columns = ["age", "category"]
df = spark.createDataFrame(data, schema=columns)
df.show()
This will create a DataFrame with the following data:
+---+-------------+
|age| category|
+---+-------------+
| 10|some_category|
| 25|some_category|
| 70|some_category|
+---+-------------+
Step 4: Apply Conditional Logic
df = df.withColumn(
"age_group",
when(df.age < 18, "child")
.when((df.age >= 18) & (df.age <= 65), "adult")
.otherwise("senior")
)
df.show()
This code adds a new column, `age_group`, to the DataFrame, applying the conditional logic as described above.
+---+-------------+---------+
|age| category|age_group|
+---+-------------+---------+
| 10|some_category| child|
| 25|some_category| adult|
| 70|some_category| senior|
+---+-------------+---------+
Explanation
The `when` function allows you to specify conditions and corresponding values. The first argument is the condition, and the second argument is the value to assign if the condition is true. The `otherwise` function specifies the value to assign if none of the conditions are met.
Conclusion
By using the `when` and `otherwise` functions, you can efficiently replace values in a PySpark DataFrame column based on conditions applied to another column. This is a versatile approach suitable for a variety of data preprocessing tasks.