To efficiently find the count of null and NaN values for each column in a PySpark DataFrame, you can use a combination of built-in functions from the `pyspark.sql.functions` module like `isnull()`, `isnan()`, `sum()`, and `col()`. Here’s a detailed explanation and a step-by-step guide on how to achieve this:
Step-by-Step Guide
Let’s say you have a Spark DataFrame named `df`.
1. Sample Data
Firstly, let’s create a sample DataFrame with some null and NaN values:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import numpy as np
# Initialize Spark session
spark = SparkSession.builder.master("local").appName("NullNanCount").getOrCreate()
# Create a DataFrame with null and NaN values
data = [
(1, float('nan'), 3, None),
(2, 2, None, 'a'),
(None, float('nan'), None, 'b'),
(4, None, 6, None)
]
columns = ["col1", "col2", "col3", "col4"]
df = spark.createDataFrame(data, columns)
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1.0| NaN| 3.0|null|
| 2.0| 2.0|null| a|
|null| NaN|null| b|
| 4.0|null| 6.0|null|
+----+----+----+----+
2. Counting Null and NaN Values for Each Column
To count the null and NaN values for each column, you can use the following code:
from pyspark.sql.functions import col, isnull, isnan, sum
# Create a dictionary to store the count of null and NaN values for each column
null_nan_counts = {}
for column in df.columns:
null_count = df.filter(isnull(col(column))).count()
nan_count = df.filter(isnan(col(column))).count() if df.select(column).dtypes[0][1] == 'double' else 0
total_count = null_count + nan_count
null_nan_counts[column] = total_count
# Show the result
for column, count in null_nan_counts.items():
print(f"Column '{column}' has {count} null/NaN values.")
Column 'col1' has 1 null/NaN values.
Column 'col2' has 3 null/NaN values.
Column 'col3' has 2 null/NaN values.
Column 'col4' has 2 null/NaN values.
Explanation
1. **Initialize Spark Session:** We start by initializing a Spark session.
2. **Create DataFrame:** We create a sample DataFrame with columns `col1`, `col2`, `col3`, and `col4` that contain some null and NaN values.
3. **Filter and Count Null/NaN Values:**
– We iterate over each column of the DataFrame.
– For null values, we use `isnull()`.
– For NaN values, we use `isnan()` on double-type columns. Note that `isnan()` applies only to columns with double data type, hence the additional check using `dtypes`.
– Sum the counts of both null and NaN values for each column.
4. **Store Results:** The counts of null and NaN values for each column are stored in a dictionary and printed out.
Alternative Approach
An alternative approach is to use the `agg()` function with a combination of `isnull()`, `isnan()`, and `sum()` for more concise code:
from pyspark.sql.functions import col, sum, when
# Count null and NaN values more efficiently
count_exprs = [
sum(when(isnull(col(column)) | (df[column] == float('NaN')), 1).otherwise(0)).alias(column)
for column in df.columns
]
null_nan_counts_df = df.agg(*count_exprs)
null_nan_counts_df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 3| 2| 2|
+----+----+----+----+
This method reduces the number of DataFrame scans by performing the operations in a single pass.