How to Subtract Two DataFrames in Apache Spark?

In Apache Spark, subtracting two DataFrames can be achieved using the `subtract` method. The `subtract` method removes the rows in one DataFrame that are also present in another DataFrame. It is similar to the SQL `EXCEPT` clause in that it returns the difference between two DataFrames. Let’s dive into a detailed explanation and examples using PySpark.

PySpark: Subtract Two DataFrames

In PySpark, you can subtract two DataFrames using the `subtract` method provided by the DataFrame API. Here’s an example:

Example

Let’s create two DataFrames and subtract them:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Subtract DataFrames").getOrCreate()

# Create first DataFrame
data1 = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
columns = ["Name", "Age"]
df1 = spark.createDataFrame(data1, columns)

# Create second DataFrame
data2 = [("Alice", 34), ("David", 30)]
df2 = spark.createDataFrame(data2, columns)

# Subtract df2 from df1
df_diff = df1.subtract(df2)

# Show the result
df_diff.show()

Code Snippet Output


+-------+---+
|   Name|Age|
+-------+---+
|Charlie| 29|
|    Bob| 45|
+-------+---+

In this example, rows that are present in `df2` and `df1` are subtracted from `df1`. The resulting DataFrame `df_diff` contains rows that are only in `df1` but not in `df2`.

Scala: Subtract Two DataFrames

Similarly, you can achieve the same in Scala. Here’s an example:

Example

Scala code for the same operation:


import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("Subtract DataFrames").getOrCreate()

// Create first DataFrame
val data1 = Seq(("Alice", 34), ("Bob", 45), ("Charlie", 29))
val df1 = spark.createDataFrame(data1).toDF("Name", "Age")

// Create second DataFrame
val data2 = Seq(("Alice", 34), ("David", 30))
val df2 = spark.createDataFrame(data2).toDF("Name", "Age")

// Subtract df2 from df1
val df_diff = df1.except(df2)

// Show the result
df_diff.show()

Code Snippet Output


+-------+---+
|   Name|Age|
+-------+---+
|Charlie| 29|
|    Bob| 45|
+-------+---+

In this Scala example, the `except` function is used, which behaves the same way as the `subtract` method in PySpark.

Considerations

There are a few things to consider when using the `subtract` method:

  • Schema Compatibility: The schemas of the two DataFrames must be the same. If the schemas are different, the `subtract` method will not work.
  • Duplicates: The `subtract` method does not handle duplicates. If you have duplicate rows in your DataFrames, they will be treated as individual entries.

Now you know how to subtract two DataFrames in Apache Spark using PySpark and Scala. This method is useful for finding differences between two datasets.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top