Replacing a string value with `null` in PySpark can be achieved using a combination of the `withColumn` method and the `when` and `otherwise` functions from the `pyspark.sql.functions` module. Below, I’ll show you an example where we replace the string value `”UNKNOWN”` with `null` in a DataFrame.
Example
Let’s say we have a DataFrame with some sample data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Create a Spark session
spark = SparkSession.builder.appName("ReplaceStringWithNull").getOrCreate()
# Sample data
data = [
(1, "John Doe", "UNKNOWN"),
(2, "Jane Smith", "New York"),
(3, "Sam Brown", "UNKNOWN"),
]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "city"])
# Show original DataFrame
df.show()
+---+----------+--------+
| id| name| city|
+---+----------+--------+
| 1| John Doe| UNKNOWN|
| 2|Jane Smith|New York|
| 3| Sam Brown| UNKNOWN|
+---+----------+--------+
Now, we will replace all occurrences of the string `”UNKNOWN”` in the `city` column with `null`:
# Replace "UNKNOWN" with null
df_with_null = df.withColumn("city", when(col("city") == "UNKNOWN", None).otherwise(col("city")))
# Show updated DataFrame
df_with_null.show()
+---+----------+--------+
| id| name| city|
+---+----------+--------+
| 1| John Doe| null|
| 2|Jane Smith|New York|
| 3| Sam Brown| null|
+---+----------+--------+
In this example:
- We create a sample DataFrame with columns `id`, `name`, and `city`.
- We then use the `withColumn` method to create a new DataFrame where the `city` column is modified.
- The `when` function is used to specify that if the value of `city` is `”UNKNOWN”`, it should be replaced with `null`. Otherwise, the original value is preserved using the `otherwise` function.
You can utilize a similar approach for other columns or different string values as needed.