How to Extract Column Values as a List in Apache Spark?

To extract column values as a list in Apache Spark, you can use different methods depending on the language and the context you are working in. Below, we’ll explore examples using PySpark, Scala, and Java.

Using PySpark

In PySpark, you can use the .collect() method after selecting the column, followed by list comprehension to get the column values as a list.


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
df = spark.createDataFrame(data, ["Name", "Id"])

# Extracting "Name" column values as a list
names_list = df.select("Name").rdd.flatMap(lambda x: x).collect()

print(names_list)

['Alice', 'Bob', 'Cathy']

Using Scala

In Scala, you can use the collect() method along with map to convert the selected column’s values to a list.


import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("example").getOrCreate()

// Sample DataFrame
val data = Seq(("Alice", 1), ("Bob", 2), ("Cathy", 3))
val df = spark.createDataFrame(data).toDF("Name", "Id")

// Extracting "Name" column values as a list
val namesList = df.select("Name").rdd.map(r => r(0).asInstanceOf[String]).collect().toList

println(namesList)

List(Alice, Bob, Cathy)

Using Java

In Java, the process involves converting the DataFrame column to an RDD and then collecting the values.


import org.apache.spark.sql.*;

import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;

public class ExtractColumnValues {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder().appName("example").getOrCreate();

        // Sample DataFrame
        List<Row> data = Arrays.asList(RowFactory.create("Alice", 1), RowFactory.create("Bob", 2), RowFactory.create("Cathy", 3));
        StructType schema = new StructType(new StructField[]{
                new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
                new StructField("Id", DataTypes.IntegerType, false, Metadata.empty())
        });
        Dataset<Row> df = spark.createDataFrame(data, schema);

        // Extracting "Name" column values as a list
        List<String> namesList = StreamSupport.stream(df.select("Name").toJavaRDD().spliterator(), false)
                .map(r -> r.getString(0))
                .collect(Collectors.toList());

        System.out.println(namesList);
    }
}

[Alice, Bob, Cathy]

In all the above examples, we first create a sample DataFrame and then extract the values of a specific column (“Name”) into a list. The method to achieve this involves converting the selected column into RDD and then collecting the values into a list using appropriate functions and methods depending on the language.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top