How Do I Replace a String Value with Null in PySpark?

Replacing a string value with `null` in PySpark can be achieved using a combination of the `withColumn` method and the `when` and `otherwise` functions from the `pyspark.sql.functions` module. Below, I’ll show you an example where we replace the string value `”UNKNOWN”` with `null` in a DataFrame.

Example

Let’s say we have a DataFrame with some sample data:


from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

# Create a Spark session
spark = SparkSession.builder.appName("ReplaceStringWithNull").getOrCreate()

# Sample data
data = [
    (1, "John Doe", "UNKNOWN"),
    (2, "Jane Smith", "New York"),
    (3, "Sam Brown", "UNKNOWN"),
]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "city"])

# Show original DataFrame
df.show()

+---+----------+--------+
| id|      name|    city|
+---+----------+--------+
|  1|  John Doe| UNKNOWN|
|  2|Jane Smith|New York|
|  3| Sam Brown| UNKNOWN|
+---+----------+--------+

Now, we will replace all occurrences of the string `”UNKNOWN”` in the `city` column with `null`:


# Replace "UNKNOWN" with null
df_with_null = df.withColumn("city", when(col("city") == "UNKNOWN", None).otherwise(col("city")))

# Show updated DataFrame
df_with_null.show()

+---+----------+--------+
| id|      name|    city|
+---+----------+--------+
|  1|  John Doe|    null|
|  2|Jane Smith|New York|
|  3| Sam Brown|    null|
+---+----------+--------+

In this example:

  • We create a sample DataFrame with columns `id`, `name`, and `city`.
  • We then use the `withColumn` method to create a new DataFrame where the `city` column is modified.
  • The `when` function is used to specify that if the value of `city` is `”UNKNOWN”`, it should be replaced with `null`. Otherwise, the original value is preserved using the `otherwise` function.

You can utilize a similar approach for other columns or different string values as needed.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top