Replace Empty Values in PySpark DataFrame :- In this guide, we’ll explore how to replace empty values across different data types in a PySpark DataFrame.
Understanding PySpark DataFrames
Before we dive into replacing empty values, it’s important to understand what PySpark DataFrames are. In simple terms, a DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R or Python (Pandas). However, unlike Pandas DataFrames, PySpark DataFrames are designed to be distributed across a computing cluster, making them highly scalable and efficient for big data processing tasks.
Identifying Empty Values in a DataFrame
Empty values can be present in a DataFrame in various forms, such as `None`, `NaN` (Not a Number), or an empty string (“”). The approach to handle these values differs based on the situation and the type of data.
Detecting Empty Values
Before replacing empty values, we need to detect them. PySpark provides several functions such as `isNull()`, `isNotNull()`, and `isNaN()` to identify these values.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when
# Create a Spark session
spark = SparkSession.builder.appName("ReplaceEmptyValues").getOrCreate()
# Sample DataFrame with empty values
data = [("James", "", "Smith", "36636", "M", 3000),
("Michael", "Rose", "", "40288", "M", 4000),
("Robert", "", "Williams", "42114", "", 4000),
("Maria", "Anne", "Jones", "39192", "F", None),
(None, "Julia", "", "", "F", 3000)]
columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Detect None or Empty values
df.withColumn("isEmpty", when(col("middlename") == "", True).otherwise(False)).show()
This snippet will output something like:
+---------+----------+---------+-----+------+------+-------+
|firstname|middlename|lastname | dob |gender|salary|isEmpty|
+---------+----------+---------+-----+------+------+-------+
| James | |Smith |... |M |3000 |true |
| Michael |Rose | |... |M |4000 |false |
| Robert | |Williams |... | |4000 |true |
| Maria |Anne |Jones |... |F |null |false |
| null |Julia | |... |F |3000 |false |
+---------+----------+---------+-----+------+------+-------+
Replacing Empty Values
Now that we can identify empty values, we can work on replacing them. PySpark provides `fillna()` and `na.fill()` to replace null/None or NaN values with a specified value.
Replacing Null Values
To replace null values, we can use `fillna` function. The `fillna()` function accepts a value and a subset of columns for replacement. If a subset is not specified, it replaces the provided value across all columns.
# Replace 'None' values in 'salary' column with a specific value
df.na.fill({"salary": 0}).show()
# Replace 'None' values across all columns with the same value
df.fillna("unknown").show()
The first example will output something like:
+---------+----------+---------+-----+------+------+-------+
|firstname|middlename|lastname | dob |gender|salary|isEmpty|
+---------+----------+---------+-----+------+------+-------+
| James | |Smith |... |M |3000 |true |
| Michael |Rose | |... |M |4000 |false |
| Robert | |Williams |... | |4000 |true |
| Maria |Anne |Jones |... |F |0 |false |
| null |Julia | |... |F |3000 |false |
+---------+----------+---------+-----+------+------+-------+
And the second example will show “unknown” replacing every `None`, empty string, and `NaN`:
+---------+----------+---------+---------+--------+-------+
|firstname|middlename|lastname | dob |gender |salary |
+---------+----------+---------+---------+--------+-------+
| James | unknown |Smith |36636 |M |3000 |
| Michael |Rose |unknown |40288 |M |4000 |
| Robert | unknown |Williams |42114 |unknown |4000 |
| Maria |Anne |Jones |39192 |F |unknown|
| unknown |Julia |unknown |unknown |F |3000 |
+---------+----------+---------+---------+--------+-------+
Replacing Empty Strings
Replacing empty strings is slightly different as it often involves replacing with another string. We can still use `fillna()` or `na.fill()` or we can use `when()` together with `otherwise()` functions to achieve the desired replacements.
df.withColumn("middlename", when(col("middlename") == "", "No Middle Name").otherwise(col("middlename"))).show()
This will replace all empty middle names with the string “No Middle Name”:
+---------+--------------+---------+-----+------+-------+
|firstname|middlename |lastname | dob |gender|salary |
+---------+--------------+---------+-----+------+-------+
| James |No Middle Name|Smith |... |M |3000 |
| Michael |Rose | |... |M |4000 |
| Robert |No Middle Name|Williams |... | |4000 |
| Maria |Anne |Jones |... |F |null |
| null |Julia | |... |F |3000 |
+---------+--------------+---------+-----+------+-------+
Conclusion
Dealing with empty values is a common and essential part of pre-processing data in PySpark. As we have seen, PySpark provides versatile functions to detect and replace null values, NaN values, and empty strings in DataFrames. Remember that different situations call for different handling of these empty values, so always be sure to understand the context of your data and the implications of replacing these values before performing any operations. With the techniques outlined above, you should be well-equipped to handle most scenarios involving empty values in PySpark DataFrames.
Lastly, don’t forget to stop the Spark session after completing your tasks to release the resources:
# Stop the Spark session
spark.stop()
We hope this comprehensive guide has been helpful in your PySpark data processing journey. Happy coding!