How to Show Distinct Column Values in PySpark DataFrame?

To show distinct column values in a PySpark DataFrame, you can use the `distinct()` or `dropDuplicates()` functions. These functions help in removing duplicate rows and allow you to see unique values in a specified column. Below is a detailed explanation and example using PySpark.

Contents hide

1 Using `distinct()` function

1.1 Example using `distinct()`

2 Using `dropDuplicates()` function

2.1 Example using `dropDuplicates()`

3 Conclusion

4 About Editorial Team

5 You Might Also Like:

Using `distinct()` function

The `distinct()` function is used to get distinct (unique) rows of a DataFrame.

Example using `distinct()`


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", 23), ("Bob", 34), ("Alice", 23), ("Eve", 29)]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])

# Show distinct column values
distinct_values_df = df.select("Name").distinct()
distinct_values_df.show()

The above code will output:


+-----+
| Name|
+-----+
| Bob |
| Alice|
| Eve |
+-----+

Using `dropDuplicates()` function

The `dropDuplicates()` function is used to drop duplicate rows based on specified columns.

Example using `dropDuplicates()`


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", "Math"), ("Bob", "Science"), ("Alice", "Math"), ("Eve", "Science")]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Subject"])

# Show distinct column values
distinct_values_df = df.dropDuplicates(["Name"])
distinct_values_df.show()

The above code will output:


+-----+-------+
| Name|Subject|
+-----+-------+
| Bob |Science|
| Alice| Math |
| Eve |Science|
+-----+-------+

In this example, the `dropDuplicates([“Name”])` function is used to remove duplicate rows based on the “Name” column, showing distinct values of the “Name” column.

Conclusion

Both `distinct()` and `dropDuplicates()` are useful for finding distinct values in a PySpark DataFrame. You can choose either based on your specific use case and requirements.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.