How Do You Fillna Values in Specific Columns of a PySpark DataFrame?

Filling missing values (nulls) in specific columns of a PySpark DataFrame is a common task in data preprocessing. You can achieve this using the `fillna` function in PySpark. Let’s go through how to do this in detail.

Using fillna to Fill Missing Values in Specific Columns

You can use the `fillna` method and specify a dictionary where the keys are the column names and the values are the replacement values for the respective columns. Here is an example:

Example Code

Let’s consider a sample PySpark DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder.appName("FillNA Example").getOrCreate()

# Sample Data
data = [
    (1, "John", None),
    (2, None, "Sales"),
    (3, "Mike", None),
    (None, "Sally", "HR")
]

# Creating a DataFrame
df = spark.createDataFrame(data, ["ID", "Name", "Department"])

# Show the DataFrame
df.show()

+----+-----+----------+
|  ID| Name|Department|
+----+-----+----------+
|   1| John|      null|
|   2| null|     Sales|
|   3| Mike|      null|
|null|Sally|        HR|
+----+-----+----------+

To fill missing values (‘NA’) in the “Name” column and “Unknown” in the “Department” column, you can use:


# Filling 'NA' in the Name column and 'Unknown' in the Department column
replacement_values = {"Name": "NA", "Department": "Unknown"}

df_filled = df.fillna(replacement_values)

# Show the DataFrame after fillna
df_filled.show()

+----+-----+---------+
|  ID| Name|Department|
+----+-----+---------+
|   1| John|   Unknown|
|   2|   NA|     Sales|
|   3| Mike|   Unknown|
|null|Sally|        HR|
+----+-----+---------+

Explanation

1. **Library Imports and Spark Session Initialization**: We import necessary modules and set up a Spark session.
2. **Creating DataFrame**: A sample DataFrame is created with some null values.
3. **Displaying Original DataFrame**: We display the original DataFrame to show the null values.
4. **Filling Missing Values**:
– We create a dictionary with the column names as keys and the replacement values as values.
– We use the `fillna` method and pass the dictionary to it.
5. **Displaying Modified DataFrame**: Finally, we display the DataFrame after the missing values are filled.

This approach ensures that only the specified columns are filled with the replacement values, leaving other columns unaffected.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top