Filling missing values (nulls) in specific columns of a PySpark DataFrame is a common task in data preprocessing. You can achieve this using the `fillna` function in PySpark. Let’s go through how to do this in detail.
Using fillna to Fill Missing Values in Specific Columns
You can use the `fillna` method and specify a dictionary where the keys are the column names and the values are the replacement values for the respective columns. Here is an example:
Example Code
Let’s consider a sample PySpark DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark Session
spark = SparkSession.builder.appName("FillNA Example").getOrCreate()
# Sample Data
data = [
(1, "John", None),
(2, None, "Sales"),
(3, "Mike", None),
(None, "Sally", "HR")
]
# Creating a DataFrame
df = spark.createDataFrame(data, ["ID", "Name", "Department"])
# Show the DataFrame
df.show()
+----+-----+----------+
| ID| Name|Department|
+----+-----+----------+
| 1| John| null|
| 2| null| Sales|
| 3| Mike| null|
|null|Sally| HR|
+----+-----+----------+
To fill missing values (‘NA’) in the “Name” column and “Unknown” in the “Department” column, you can use:
# Filling 'NA' in the Name column and 'Unknown' in the Department column
replacement_values = {"Name": "NA", "Department": "Unknown"}
df_filled = df.fillna(replacement_values)
# Show the DataFrame after fillna
df_filled.show()
+----+-----+---------+
| ID| Name|Department|
+----+-----+---------+
| 1| John| Unknown|
| 2| NA| Sales|
| 3| Mike| Unknown|
|null|Sally| HR|
+----+-----+---------+
Explanation
1. **Library Imports and Spark Session Initialization**: We import necessary modules and set up a Spark session.
2. **Creating DataFrame**: A sample DataFrame is created with some null values.
3. **Displaying Original DataFrame**: We display the original DataFrame to show the null values.
4. **Filling Missing Values**:
– We create a dictionary with the column names as keys and the replacement values as values.
– We use the `fillna` method and pass the dictionary to it.
5. **Displaying Modified DataFrame**: Finally, we display the DataFrame after the missing values are filled.
This approach ensures that only the specified columns are filled with the replacement values, leaving other columns unaffected.