In Apache Spark, subtracting two DataFrames can be achieved using the `subtract` method. The `subtract` method removes the rows in one DataFrame that are also present in another DataFrame. It is similar to the SQL `EXCEPT` clause in that it returns the difference between two DataFrames. Let’s dive into a detailed explanation and examples using PySpark.
PySpark: Subtract Two DataFrames
In PySpark, you can subtract two DataFrames using the `subtract` method provided by the DataFrame API. Here’s an example:
Example
Let’s create two DataFrames and subtract them:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("Subtract DataFrames").getOrCreate()
# Create first DataFrame
data1 = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
columns = ["Name", "Age"]
df1 = spark.createDataFrame(data1, columns)
# Create second DataFrame
data2 = [("Alice", 34), ("David", 30)]
df2 = spark.createDataFrame(data2, columns)
# Subtract df2 from df1
df_diff = df1.subtract(df2)
# Show the result
df_diff.show()
Code Snippet Output
+-------+---+
| Name|Age|
+-------+---+
|Charlie| 29|
| Bob| 45|
+-------+---+
In this example, rows that are present in `df2` and `df1` are subtracted from `df1`. The resulting DataFrame `df_diff` contains rows that are only in `df1` but not in `df2`.
Scala: Subtract Two DataFrames
Similarly, you can achieve the same in Scala. Here’s an example:
Example
Scala code for the same operation:
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.appName("Subtract DataFrames").getOrCreate()
// Create first DataFrame
val data1 = Seq(("Alice", 34), ("Bob", 45), ("Charlie", 29))
val df1 = spark.createDataFrame(data1).toDF("Name", "Age")
// Create second DataFrame
val data2 = Seq(("Alice", 34), ("David", 30))
val df2 = spark.createDataFrame(data2).toDF("Name", "Age")
// Subtract df2 from df1
val df_diff = df1.except(df2)
// Show the result
df_diff.show()
Code Snippet Output
+-------+---+
| Name|Age|
+-------+---+
|Charlie| 29|
| Bob| 45|
+-------+---+
In this Scala example, the `except` function is used, which behaves the same way as the `subtract` method in PySpark.
Considerations
There are a few things to consider when using the `subtract` method:
- Schema Compatibility: The schemas of the two DataFrames must be the same. If the schemas are different, the `subtract` method will not work.
- Duplicates: The `subtract` method does not handle duplicates. If you have duplicate rows in your DataFrames, they will be treated as individual entries.
Now you know how to subtract two DataFrames in Apache Spark using PySpark and Scala. This method is useful for finding differences between two datasets.