How to Remove Duplicates from Rows Based on Specific Columns in an RDD/Spark DataFrame?

Removing duplicates from rows based on specific columns in an RDD or Spark DataFrame is a common task in data processing. Below, let’s explore how to accomplish this task using both PySpark and Scala. We will use a simple DataFrame for illustration.

Removing Duplicates Using PySpark

First, let’s create a sample DataFrame using PySpark:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()

# Sample Data
data = [("Alice", 34, "HR"),
        ("Bob", 45, "Finance"),
        ("Alice", 34, "HR"),
        ("David", 29, "HR")]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age", "Department"])

# Show original DataFrame
print("Original DataFrame:")
df.show()

Original DataFrame:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|Alice| 34|        HR|
|  Bob| 45|   Finance|
|Alice| 34|        HR|
|David| 29|        HR|
+-----+---+----------+

To remove duplicates based on specific columns (`Name` and `Age` in this case):


# Remove duplicates based on the 'Name' and 'Age' columns
df_dedup = df.dropDuplicates(["Name", "Age"])

# Show the DataFrame after removing duplicates
print("DataFrame After Removing Duplicates:")
df_dedup.show()

DataFrame After Removing Duplicates:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|  Bob| 45|   Finance|
|Alice| 34|        HR|
|David| 29|        HR|
+-----+---+----------+

Removing Duplicates Using Scala

Below is the Scala equivalent code to remove duplicates based on specific columns:


import org.apache.spark.sql.SparkSession

// Initialize Spark session
val spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()

// Sample Data
val data = Seq(
    ("Alice", 34, "HR"),
    ("Bob", 45, "Finance"),
    ("Alice", 34, "HR"),
    ("David", 29, "HR")
)

// Create DataFrame
import spark.implicits._
val df = data.toDF("Name", "Age", "Department")

// Show original DataFrame
println("Original DataFrame:")
df.show()

Original DataFrame:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|Alice| 34|        HR|
|  Bob| 45|   Finance|
|Alice| 34|        HR|
|David| 29|        HR|
+-----+---+----------+

To remove duplicates based on specific columns (`Name` and `Age` in this case):


// Remove duplicates based on the 'Name' and 'Age' columns
val df_dedup = df.dropDuplicates("Name", "Age")

// Show the DataFrame after removing duplicates
println("DataFrame After Removing Duplicates:")
df_dedup.show()

DataFrame After Removing Duplicates:
+-----+---+----------+
| Name|Age|Department|
+-----+---+----------+
|  Bob| 45|   Finance|
|Alice| 34|        HR|
|David| 29|        HR|
+-----+---+----------+

With both PySpark and Scala, the `dropDuplicates` method is used to remove duplicate rows based on specified columns. The resulting DataFrame contains only unique rows for the specified columns.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top