How to Use Column.isin with a List in Apache Spark?

When working with Apache Spark, the `Column.isin` method is commonly used to filter DataFrame rows based on whether a column’s value exists within a list of specified values. Below we will explore how to leverage this method using PySpark, Scala, and Java for better understanding and comprehensive utilization.

Using Column.isin with a List in PySpark

In PySpark, you can use the `isin` method on a DataFrame column to filter rows where the column’s value is part of a given list. Here’s an example:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# List of names to filter
name_list = ["Bob", "Cathy"]

# Filter DataFrame using isin
filtered_df = df.filter(col("Name").isin(name_list))

# Show results
filtered_df.show()

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 45|
|Cathy| 29|
+-----+---+

Using Column.isin with a List in Scala

In Scala, you will use a similar approach by utilizing the `isin` method. Here’s an example:


import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder.appName("example").getOrCreate()

// Sample data
val data = Seq(("Alice", 34), ("Bob", 45), ("Cathy", 29))
val columns = Seq("Name", "Age")

// Create DataFrame
import spark.implicits._
val df = data.toDF(columns: _*)

// List of names to filter
val nameList = List("Bob", "Cathy")

// Filter DataFrame using isin
val filteredDf = df.filter($"Name".isin(nameList: _*))

// Show results
filteredDf.show()

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 45|
|Cathy| 29|
+-----+---+

Using Column.isin with a List in Java

In Java, you will also use the `isin` method, but the syntax differs slightly. Here’s an example:


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.col;

// Initialize SparkSession
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();

// Sample data
List<Tuple2<String, Integer>> data = Arrays.asList(
    new Tuple2<>("Alice", 34),
    new Tuple2<>("Bob", 45),
    new Tuple2<>("Cathy", 29)
);

// Define schema
StructType schema = new StructType()
    .add("Name", "string")
    .add("Age", "integer");

// Create DataFrame
Dataset<Row> df = spark.createDataFrame(data, schema);

// List of names to filter
List<String> nameList = Arrays.asList("Bob", "Cathy");

// Filter DataFrame using isin
Dataset<Row> filteredDf = df.filter(col("Name").isin(nameList.toArray()));

// Show results
filteredDf.show();

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 45|
|Cathy| 29|
+-----+---+

Conclusion

The `Column.isin` method is a powerful and easy-to-use feature in Apache Spark for filtering rows based on a list of values. Whether using PySpark, Scala, or Java, the method provides a straightforward way to accomplish this task. By following the examples provided, you can efficiently apply the `isin` method across different languages and use cases.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top