How to Show Distinct Column Values in PySpark DataFrame?

To show distinct column values in a PySpark DataFrame, you can use the `distinct()` or `dropDuplicates()` functions. These functions help in removing duplicate rows and allow you to see unique values in a specified column. Below is a detailed explanation and example using PySpark.

Using `distinct()` function

The `distinct()` function is used to get distinct (unique) rows of a DataFrame.

Example using `distinct()`


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", 23), ("Bob", 34), ("Alice", 23), ("Eve", 29)]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])

# Show distinct column values
distinct_values_df = df.select("Name").distinct()
distinct_values_df.show()

The above code will output:


+-----+
| Name|
+-----+
| Bob |
| Alice|
| Eve |
+-----+

Using `dropDuplicates()` function

The `dropDuplicates()` function is used to drop duplicate rows based on specified columns.

Example using `dropDuplicates()`


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", "Math"), ("Bob", "Science"), ("Alice", "Math"), ("Eve", "Science")]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Subject"])

# Show distinct column values
distinct_values_df = df.dropDuplicates(["Name"])
distinct_values_df.show()

The above code will output:


+-----+-------+
| Name|Subject|
+-----+-------+
| Bob |Science|
| Alice| Math |
| Eve |Science|
+-----+-------+

In this example, the `dropDuplicates([“Name”])` function is used to remove duplicate rows based on the “Name” column, showing distinct values of the “Name” column.

Conclusion

Both `distinct()` and `dropDuplicates()` are useful for finding distinct values in a PySpark DataFrame. You can choose either based on your specific use case and requirements.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top