Replace Empty Values in PySpark DataFrame

Replace Empty Values in PySpark DataFrame :- In this guide, we’ll explore how to replace empty values across different data types in a PySpark DataFrame.

Contents hide

1 Understanding PySpark DataFrames

2 Identifying Empty Values in a DataFrame

2.1 Detecting Empty Values

3 Replacing Empty Values

3.1 Replacing Null Values

3.2 Replacing Empty Strings

4 Conclusion

5 About Editorial Team

6 You Might Also Like:

Understanding PySpark DataFrames

Before we dive into replacing empty values, it’s important to understand what PySpark DataFrames are. In simple terms, a DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R or Python (Pandas). However, unlike Pandas DataFrames, PySpark DataFrames are designed to be distributed across a computing cluster, making them highly scalable and efficient for big data processing tasks.

Identifying Empty Values in a DataFrame

Empty values can be present in a DataFrame in various forms, such as `None`, `NaN` (Not a Number), or an empty string (“”). The approach to handle these values differs based on the situation and the type of data.

Detecting Empty Values

Before replacing empty values, we need to detect them. PySpark provides several functions such as `isNull()`, `isNotNull()`, and `isNaN()` to identify these values.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when

# Create a Spark session
spark = SparkSession.builder.appName("ReplaceEmptyValues").getOrCreate()

# Sample DataFrame with empty values
data = [("James", "", "Smith", "36636", "M", 3000),
        ("Michael", "Rose", "", "40288", "M", 4000),
        ("Robert", "", "Williams", "42114", "", 4000),
        ("Maria", "Anne", "Jones", "39192", "F", None),
        (None, "Julia", "", "", "F", 3000)]

columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Detect None or Empty values
df.withColumn("isEmpty", when(col("middlename") == "", True).otherwise(False)).show()

This snippet will output something like:


+---------+----------+---------+-----+------+------+-------+
|firstname|middlename|lastname | dob |gender|salary|isEmpty|
+---------+----------+---------+-----+------+------+-------+
| James   |          |Smith    |...  |M     |3000  |true   |
| Michael |Rose      |         |...  |M     |4000  |false  |
| Robert  |          |Williams |...  |      |4000  |true   |
| Maria   |Anne      |Jones    |...  |F     |null  |false  |
| null    |Julia     |         |...  |F     |3000  |false  |
+---------+----------+---------+-----+------+------+-------+

Replacing Empty Values

Now that we can identify empty values, we can work on replacing them. PySpark provides `fillna()` and `na.fill()` to replace null/None or NaN values with a specified value.

Replacing Null Values

To replace null values, we can use `fillna` function. The `fillna()` function accepts a value and a subset of columns for replacement. If a subset is not specified, it replaces the provided value across all columns.


# Replace 'None' values in 'salary' column with a specific value
df.na.fill({"salary": 0}).show()

# Replace 'None' values across all columns with the same value
df.fillna("unknown").show()

The first example will output something like:


+---------+----------+---------+-----+------+------+-------+
|firstname|middlename|lastname | dob |gender|salary|isEmpty|
+---------+----------+---------+-----+------+------+-------+
| James   |          |Smith    |...  |M     |3000  |true   |
| Michael |Rose      |         |...  |M     |4000  |false  |
| Robert  |          |Williams |...  |      |4000  |true   |
| Maria   |Anne      |Jones    |...  |F     |0     |false  |
| null    |Julia     |         |...  |F     |3000  |false  |
+---------+----------+---------+-----+------+------+-------+

And the second example will show “unknown” replacing every `None`, empty string, and `NaN`:


+---------+----------+---------+---------+--------+-------+
|firstname|middlename|lastname | dob     |gender  |salary |
+---------+----------+---------+---------+--------+-------+
| James   | unknown  |Smith    |36636    |M       |3000   |
| Michael |Rose      |unknown  |40288    |M       |4000   |
| Robert  | unknown  |Williams |42114    |unknown |4000   |
| Maria   |Anne      |Jones    |39192    |F       |unknown|
| unknown |Julia     |unknown  |unknown  |F       |3000   |
+---------+----------+---------+---------+--------+-------+

Replacing Empty Strings

Replacing empty strings is slightly different as it often involves replacing with another string. We can still use `fillna()` or `na.fill()` or we can use `when()` together with `otherwise()` functions to achieve the desired replacements.


df.withColumn("middlename", when(col("middlename") == "", "No Middle Name").otherwise(col("middlename"))).show()

This will replace all empty middle names with the string “No Middle Name”:


+---------+--------------+---------+-----+------+-------+
|firstname|middlename    |lastname | dob |gender|salary |
+---------+--------------+---------+-----+------+-------+
| James   |No Middle Name|Smith    |...  |M     |3000   |
| Michael |Rose          |         |...  |M     |4000   |
| Robert  |No Middle Name|Williams |...  |      |4000   |
| Maria   |Anne          |Jones    |...  |F     |null   |
| null    |Julia         |         |...  |F     |3000   |
+---------+--------------+---------+-----+------+-------+

Conclusion

Dealing with empty values is a common and essential part of pre-processing data in PySpark. As we have seen, PySpark provides versatile functions to detect and replace null values, NaN values, and empty strings in DataFrames. Remember that different situations call for different handling of these empty values, so always be sure to understand the context of your data and the implications of replacing these values before performing any operations. With the techniques outlined above, you should be well-equipped to handle most scenarios involving empty values in PySpark DataFrames.

Lastly, don’t forget to stop the Spark session after completing your tasks to release the resources:


# Stop the Spark session
spark.stop()

We hope this comprehensive guide has been helpful in your PySpark data processing journey. Happy coding!

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.