How to Efficiently Find Count of Null and NaN Values for Each Column in a PySpark DataFrame?

To efficiently find the count of null and NaN values for each column in a PySpark DataFrame, you can use a combination of built-in functions from the `pyspark.sql.functions` module like `isnull()`, `isnan()`, `sum()`, and `col()`. Here’s a detailed explanation and a step-by-step guide on how to achieve this:

Step-by-Step Guide

Let’s say you have a Spark DataFrame named `df`.

1. Sample Data

Firstly, let’s create a sample DataFrame with some null and NaN values:


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import numpy as np

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("NullNanCount").getOrCreate()

# Create a DataFrame with null and NaN values
data = [
    (1, float('nan'), 3, None),
    (2, 2, None, 'a'),
    (None, float('nan'), None, 'b'),
    (4, None, 6, None)
]
columns = ["col1", "col2", "col3", "col4"]

df = spark.createDataFrame(data, columns)
df.show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1.0| NaN| 3.0|null|
| 2.0| 2.0|null|   a|
|null| NaN|null|   b|
| 4.0|null| 6.0|null|
+----+----+----+----+

2. Counting Null and NaN Values for Each Column

To count the null and NaN values for each column, you can use the following code:


from pyspark.sql.functions import col, isnull, isnan, sum

# Create a dictionary to store the count of null and NaN values for each column
null_nan_counts = {}

for column in df.columns:
    null_count = df.filter(isnull(col(column))).count()
    nan_count = df.filter(isnan(col(column))).count() if df.select(column).dtypes[0][1] == 'double' else 0
    total_count = null_count + nan_count
    null_nan_counts[column] = total_count

# Show the result
for column, count in null_nan_counts.items():
    print(f"Column '{column}' has {count} null/NaN values.")

Column 'col1' has 1 null/NaN values.
Column 'col2' has 3 null/NaN values.
Column 'col3' has 2 null/NaN values.
Column 'col4' has 2 null/NaN values.

Explanation

1. **Initialize Spark Session:** We start by initializing a Spark session.
2. **Create DataFrame:** We create a sample DataFrame with columns `col1`, `col2`, `col3`, and `col4` that contain some null and NaN values.
3. **Filter and Count Null/NaN Values:**
– We iterate over each column of the DataFrame.
– For null values, we use `isnull()`.
– For NaN values, we use `isnan()` on double-type columns. Note that `isnan()` applies only to columns with double data type, hence the additional check using `dtypes`.
– Sum the counts of both null and NaN values for each column.
4. **Store Results:** The counts of null and NaN values for each column are stored in a dictionary and printed out.

Alternative Approach

An alternative approach is to use the `agg()` function with a combination of `isnull()`, `isnan()`, and `sum()` for more concise code:


from pyspark.sql.functions import col, sum, when

# Count null and NaN values more efficiently
count_exprs = [
    sum(when(isnull(col(column)) | (df[column] == float('NaN')), 1).otherwise(0)).alias(column)
    for column in df.columns
]

null_nan_counts_df = df.agg(*count_exprs)

null_nan_counts_df.show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   3|   2|   2|
+----+----+----+----+

This method reduces the number of DataFrame scans by performing the operations in a single pass.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top