Removing duplicates from rows based on specific columns in an RDD or Spark DataFrame is a common task in data processing. Below, let’s explore how to accomplish this task using both PySpark and Scala. We will use a simple DataFrame for illustration.
Removing Duplicates Using PySpark
First, let’s create a sample DataFrame using PySpark:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
# Sample Data
data = [("Alice", 34, "HR"),
("Bob", 45, "Finance"),
("Alice", 34, "HR"),
("David", 29, "HR")]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age", "Department"])
# Show original DataFrame
print("Original DataFrame:")
df.show()
Original DataFrame:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|Alice| 34| HR|
| Bob| 45| Finance|
|Alice| 34| HR|
|David| 29| HR|
+-----+---+----------+
To remove duplicates based on specific columns (`Name` and `Age` in this case):
# Remove duplicates based on the 'Name' and 'Age' columns
df_dedup = df.dropDuplicates(["Name", "Age"])
# Show the DataFrame after removing duplicates
print("DataFrame After Removing Duplicates:")
df_dedup.show()
DataFrame After Removing Duplicates:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
| Bob| 45| Finance|
|Alice| 34| HR|
|David| 29| HR|
+-----+---+----------+
Removing Duplicates Using Scala
Below is the Scala equivalent code to remove duplicates based on specific columns:
import org.apache.spark.sql.SparkSession
// Initialize Spark session
val spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
// Sample Data
val data = Seq(
("Alice", 34, "HR"),
("Bob", 45, "Finance"),
("Alice", 34, "HR"),
("David", 29, "HR")
)
// Create DataFrame
import spark.implicits._
val df = data.toDF("Name", "Age", "Department")
// Show original DataFrame
println("Original DataFrame:")
df.show()
Original DataFrame:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|Alice| 34| HR|
| Bob| 45| Finance|
|Alice| 34| HR|
|David| 29| HR|
+-----+---+----------+
To remove duplicates based on specific columns (`Name` and `Age` in this case):
// Remove duplicates based on the 'Name' and 'Age' columns
val df_dedup = df.dropDuplicates("Name", "Age")
// Show the DataFrame after removing duplicates
println("DataFrame After Removing Duplicates:")
df_dedup.show()
DataFrame After Removing Duplicates:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
| Bob| 45| Finance|
|Alice| 34| HR|
|David| 29| HR|
+-----+---+----------+
With both PySpark and Scala, the `dropDuplicates` method is used to remove duplicate rows based on specified columns. The resulting DataFrame contains only unique rows for the specified columns.