How to Filter Out Null Values from a Spark DataFrame: Step-by-Step Guide

Filtering out null values from a Spark DataFrame is a common operation in data preprocessing. Here’s a step-by-step guide to achieve this using PySpark.

Step-by-Step Guide to Filter Out Null Values from a Spark DataFrame

Step 1: Initialize SparkSession

First, you need to initialize a SparkSession, which is the entry point to programming Spark with the Dataset and DataFrame API.


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("FilterNullValuesExample") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

For demonstration purposes, let’s create a sample DataFrame with some null values.


# Sample data
data = [
    (1, "Alice", 29),
    (2, "Bob", None),
    (3, None, 35),
    (4, "David", 40),
    (None, "Eve", None)
]

# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "age"])

# Show the initial DataFrame
df.show()

Output:


+----+-----+----+
|  id| name| age|
+----+-----+----+
|   1|Alice|  29|
|   2|  Bob|null|
|   3| null|  35|
|   4|David|  40|
|null|  Eve|null|
+----+-----+----+

Step 3: Filter Out Null Values

You can filter out rows with null values using the `na.drop()` method, which removes rows containing any null values.


# Filter out rows with any null values
df_filtered = df.na.drop()

# Show the filtered DataFrame
df_filtered.show()

Output:


+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 29|
|  4|David| 40|
+---+-----+---+

Step 4: Filter Specific Columns

Sometimes, you may only want to remove rows where specific columns have null values. You can pass the column names as arguments to the `na.drop()` method.


# Filter out rows with null values in the 'name' column
df_filtered_specific = df.na.drop(subset=["name"])

# Show the filtered DataFrame
df_filtered_specific.show()

Output:


+----+-----+----+
|  id| name| age|
+----+-----+----+
|   1|Alice|  29|
|   2|  Bob|null|
|   4|David|  40|
|null|  Eve|null|
+----+-----+----+

Conclusion

Filtering out null values from a Spark DataFrame is straightforward using the `na.drop()` method. You can either remove rows with any null values or remove rows based on specific columns. This data cleaning step is essential for many data processing tasks to ensure data integrity and quality.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top