How to Fetch Distinct Values from a Column in Spark DataFrame?

Fetching distinct values from a column in a Spark DataFrame is a common operation. It helps in identifying unique entries in the data, which is crucial for various analyses. Below, we’ll explore how to achieve this using PySpark and Scala.

Fetching Distinct Values using PySpark

Using PySpark, you can retrieve distinct values from a column in a DataFrame by leveraging the `distinct()` method and selecting the specific column.


from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DistinctValuesExample").getOrCreate()

# Sample Data
data = [("Alice", 34), ("Bob", 45), ("Alice", 34), ("David", 29)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Fetch Distinct Values from 'Name' Column
distinct_names = df.select("Name").distinct()

# Show the results
distinct_names.show()

+-----+
| Name|
+-----+
|David|
|Alice|
|  Bob|
+-----+

Fetching Distinct Values using Scala

The same operation can be performed in Scala using the Spark DataFrame API. We’ll use the `distinct` method after selecting the specific column.


import org.apache.spark.sql.SparkSession

// Initialize Spark Session
val spark = SparkSession.builder.appName("DistinctValuesExample").getOrCreate()

// Sample Data
val data = Seq(("Alice", 34), ("Bob", 45), ("Alice", 34), ("David", 29))
val columns = Seq("Name", "Age")

// Create DataFrame
import spark.implicits._
val df = data.toDF(columns: _*)

// Fetch Distinct Values from 'Name' Column
val distinctNames = df.select("Name").distinct()

// Show the results
distinctNames.show()

+-----+
| Name|
+-----+
|David|
|Alice|
|  Bob|
+-----+

Explanation

Here’s a detailed explanation of each part of the code:

  • Initializing Spark Session: This step starts a new Spark session, which is required to create and manipulate DataFrames.
  • Sample Data: We create a sample dataset to work with. The data is structured as tuples or lists, depending on the language.
  • Create DataFrame: We then create a DataFrame from the sample data. In PySpark, we use `spark.createDataFrame`, while in Scala, we convert a sequence to a DataFrame using the `toDF` method.
  • Fetching Distinct Values: The `select` method is used to focus on the specific column, and `distinct` ensures that only unique values are retained.
  • Displaying Results: Finally, we use the `show` method to display the distinct values.

This approach is fundamental and highly useful, especially when performing data cleaning, deduplication, and exploratory data analysis.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top