How Do You Fillna Values in Specific Columns of a PySpark DataFrame?

Filling missing values (nulls) in specific columns of a PySpark DataFrame is a common task in data preprocessing. You can achieve this using the `fillna` function in PySpark. Let’s go through how to do this in detail.

Using fillna to Fill Missing Values in Specific Columns

You can use the `fillna` method and specify a dictionary where the keys are the column names and the values are the replacement values for the respective columns. Here is an example:

Example Code

Let’s consider a sample PySpark DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder.appName("FillNA Example").getOrCreate()

# Sample Data
data = [
    (1, "John", None),
    (2, None, "Sales"),
    (3, "Mike", None),
    (None, "Sally", "HR")
]

# Creating a DataFrame
df = spark.createDataFrame(data, ["ID", "Name", "Department"])

# Show the DataFrame
df.show()

+----+-----+----------+
|  ID| Name|Department|
+----+-----+----------+
|   1| John|      null|
|   2| null|     Sales|
|   3| Mike|      null|
|null|Sally|        HR|
+----+-----+----------+

To fill missing values (‘NA’) in the “Name” column and “Unknown” in the “Department” column, you can use:


# Filling 'NA' in the Name column and 'Unknown' in the Department column
replacement_values = {"Name": "NA", "Department": "Unknown"}

df_filled = df.fillna(replacement_values)

# Show the DataFrame after fillna
df_filled.show()

+----+-----+---------+
|  ID| Name|Department|
+----+-----+---------+
|   1| John|   Unknown|
|   2|   NA|     Sales|
|   3| Mike|   Unknown|
|null|Sally|        HR|
+----+-----+---------+

Explanation

1. **Library Imports and Spark Session Initialization**: We import necessary modules and set up a Spark session.
2. **Creating DataFrame**: A sample DataFrame is created with some null values.
3. **Displaying Original DataFrame**: We display the original DataFrame to show the null values.
4. **Filling Missing Values**:
– We create a dictionary with the column names as keys and the replacement values as values.
– We use the `fillna` method and pass the dictionary to it.
5. **Displaying Modified DataFrame**: Finally, we display the DataFrame after the missing values are filled.

This approach ensures that only the specified columns are filled with the replacement values, leaving other columns unaffected.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top