To show distinct column values in a PySpark DataFrame, you can use the `distinct()` or `dropDuplicates()` functions. These functions help in removing duplicate rows and allow you to see unique values in a specified column. Below is a detailed explanation and example using PySpark.
Using `distinct()` function
The `distinct()` function is used to get distinct (unique) rows of a DataFrame.
Example using `distinct()`
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", 23), ("Bob", 34), ("Alice", 23), ("Eve", 29)]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])
# Show distinct column values
distinct_values_df = df.select("Name").distinct()
distinct_values_df.show()
The above code will output:
+-----+
| Name|
+-----+
| Bob |
| Alice|
| Eve |
+-----+
Using `dropDuplicates()` function
The `dropDuplicates()` function is used to drop duplicate rows based on specified columns.
Example using `dropDuplicates()`
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", "Math"), ("Bob", "Science"), ("Alice", "Math"), ("Eve", "Science")]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Subject"])
# Show distinct column values
distinct_values_df = df.dropDuplicates(["Name"])
distinct_values_df.show()
The above code will output:
+-----+-------+
| Name|Subject|
+-----+-------+
| Bob |Science|
| Alice| Math |
| Eve |Science|
+-----+-------+
In this example, the `dropDuplicates([“Name”])` function is used to remove duplicate rows based on the “Name” column, showing distinct values of the “Name” column.
Conclusion
Both `distinct()` and `dropDuplicates()` are useful for finding distinct values in a PySpark DataFrame. You can choose either based on your specific use case and requirements.