Filtering out null values from a Spark DataFrame is a common operation in data preprocessing. Here’s a step-by-step guide to achieve this using PySpark.
Step-by-Step Guide to Filter Out Null Values from a Spark DataFrame
Step 1: Initialize SparkSession
First, you need to initialize a SparkSession, which is the entry point to programming Spark with the Dataset and DataFrame API.
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName("FilterNullValuesExample") \
.getOrCreate()
Step 2: Create a Sample DataFrame
For demonstration purposes, let’s create a sample DataFrame with some null values.
# Sample data
data = [
(1, "Alice", 29),
(2, "Bob", None),
(3, None, 35),
(4, "David", 40),
(None, "Eve", None)
]
# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "age"])
# Show the initial DataFrame
df.show()
Output:
+----+-----+----+
| id| name| age|
+----+-----+----+
| 1|Alice| 29|
| 2| Bob|null|
| 3| null| 35|
| 4|David| 40|
|null| Eve|null|
+----+-----+----+
Step 3: Filter Out Null Values
You can filter out rows with null values using the `na.drop()` method, which removes rows containing any null values.
# Filter out rows with any null values
df_filtered = df.na.drop()
# Show the filtered DataFrame
df_filtered.show()
Output:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 4|David| 40|
+---+-----+---+
Step 4: Filter Specific Columns
Sometimes, you may only want to remove rows where specific columns have null values. You can pass the column names as arguments to the `na.drop()` method.
# Filter out rows with null values in the 'name' column
df_filtered_specific = df.na.drop(subset=["name"])
# Show the filtered DataFrame
df_filtered_specific.show()
Output:
+----+-----+----+
| id| name| age|
+----+-----+----+
| 1|Alice| 29|
| 2| Bob|null|
| 4|David| 40|
|null| Eve|null|
+----+-----+----+
Conclusion
Filtering out null values from a Spark DataFrame is straightforward using the `na.drop()` method. You can either remove rows with any null values or remove rows based on specific columns. This data cleaning step is essential for many data processing tasks to ensure data integrity and quality.