How to Count Distinct Values of Every Column in a Spark DataFrame?

Counting distinct values of every column in a Spark DataFrame is a common requirement in data analysis tasks. This can be achieved using PySpark, Scala, or Java. Below, I will provide examples using PySpark to demonstrate how to achieve this.

Using PySpark

First, let’s assume we have a sample DataFrame:


from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("DistinctValues").getOrCreate()

# Sample data
data = [
    (1, "A", 100),
    (2, "B", 200),
    (3, "A", 300),
    (4, "C", 100),
    (5, "B", 200),
]

# Sample schema
columns = ["ID", "Category", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

The output will be:


+---+--------+-----+
| ID|Category|Value|
+---+--------+-----+
|  1|       A|  100|
|  2|       B|  200|
|  3|       A|  300|
|  4|       C|  100|
|  5|       B|  200|
+---+--------+-----+

Counting Distinct Values

To count the distinct values for each of the columns, we can use the `distinct()` function along with `count()`. You can iterate through all columns and use these functions to achieve the desired result.


from pyspark.sql.functions import col

# Initialize an empty dictionary to store the results
distinct_counts = {}

# Iterate through each column in the DataFrame
for column in df.columns:
    count = df.select(col(column)).distinct().count()
    distinct_counts[column] = count

# Print the distinct counts for each column
for column, count in distinct_counts.items():
    print(f"{column}: {count}")

The output will be:


ID: 5
Category: 3
Value: 3

Explanation

Let’s break down the steps:

1. **Initialization**:
– We initialize a Spark session and create a sample DataFrame.

2. **Data Inspection**:
– We print the DataFrame to verify its contents.

3. **Counting Distinct Values**:
– We initialize an empty dictionary `distinct_counts` to store the distinct counts.
– We iterate through each column of the DataFrame with a `for` loop.
– For each column, we use `select(column)` to select the column and `distinct().count()` to get the count of distinct values.
– We store the result in the dictionary with the column name as the key.

4. **Output**:
– Finally, we print the distinct counts for each column.

In case you want to achieve the same task using Scala or Java, the logic remains similar but involves syntactic differences. Would you like an example using Scala or Java as well?

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top