What is the Difference Between == and === in Scala and Spark?

Sure, let’s dive into the differences between `==` and `===` in both Scala and Apache Spark.

Difference Between == and === in Scala and Spark

In Scala

In Scala, `==` is used to compare two objects for equality. It internally calls the `equals` method. The `==` operator is part of the Scala standard library and is used for general equality testing.

On the other hand, `===` is not a standard Scala operator. It is specific to certain libraries, like Cats, which use the `===` operator to signify type-safe equality comparison. This means you need to bring in an additional library to use `===` in Scala.

Example in Scala:

Here, we’ll compare two strings using `==` and `===` (assuming we have the Cats library for `===`).


import cats.implicits._

val str1 = "Hello"
val str2 = "Hello"
val str3 = "World"

// Using == in Scala
val result1 = (str1 == str2) // true
val result2 = (str1 == str3) // false

// Using === in Scala (requires Cats library)
val result3 = (str1 === str2) // true
val result4 = (str1 === str3) // false

println(s"Using ==: result1 = $result1, result2 = $result2")
println(s"Using ===: result3 = $result3, result4 = $result4")

Using ==: result1 = true, result2 = false
Using ===: result3 = true, result4 = false

In Apache Spark

In the context of Spark, which uses the DataFrame API, `==` and `===` serve different purposes:

  • ==: This is a comparison operator in Scala, but when used in the context of Spark DataFrames, it is not used for column comparison.
  • ===: This is a method provided by Spark’s Column class for equality comparisons of columns within DataFrames.

In Spark, `===` is part of the `Column` class and is used to compare the values of two columns for equality. It returns a new column with boolean values.

Example in PySpark:

Here is an example of using `===` to compare two columns in a DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

data = [(1, "Alice"), (2, "Bob"), (1, "Alice")]
df = spark.createDataFrame(data, ["ID", "Name"])

# Using === to compare columns
result_df = df.withColumn("Is_Same", col("ID") == col("Name"))
result_df.show()

# Using == wrongly raises an error

# result_df_wrong = df.withColumn("Is_Same", df.col("ID") == 1)
# result_df_wrong.show()

+---+-----+-------+
| ID| Name|Is_Same|
+---+-----+-------+
|  1|Alice|  false|
|  2|  Bob|  false|
|  1|Alice|  false|
+---+-----+-------+

Example in Scala with Spark:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
import spark.implicits._

val data = Seq((1, "Alice"), (2, "Bob"), (1, "Alice"))
val df = data.toDF("ID", "Name")

// Using === to compare columns
val resultDF = df.withColumn("Is_Same", $"ID" === $"Name")
resultDF.show()

// Using == wrongly raises an error
// val resultDFWrong = df.withColumn("Is_Same", df("ID") == 1)
// resultDFWrong.show()

+---+-----+-------+
| ID| Name|Is_Same|
+---+-----+-------+
|  1|Alice|  false|
|  2|  Bob|  false|
|  1|Alice|  false|
+---+-----+-------+

Both examples show that in the context of Spark, you should use `===` rather than `==` for column comparisons in DataFrames.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top