To extract column values as a list in Apache Spark, you can use different methods depending on the language and the context you are working in. Below, we’ll explore examples using PySpark, Scala, and Java.
Using PySpark
In PySpark, you can use the .collect()
method after selecting the column, followed by list comprehension to get the column values as a list.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
df = spark.createDataFrame(data, ["Name", "Id"])
# Extracting "Name" column values as a list
names_list = df.select("Name").rdd.flatMap(lambda x: x).collect()
print(names_list)
['Alice', 'Bob', 'Cathy']
Using Scala
In Scala, you can use the collect()
method along with map
to convert the selected column’s values to a list.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("example").getOrCreate()
// Sample DataFrame
val data = Seq(("Alice", 1), ("Bob", 2), ("Cathy", 3))
val df = spark.createDataFrame(data).toDF("Name", "Id")
// Extracting "Name" column values as a list
val namesList = df.select("Name").rdd.map(r => r(0).asInstanceOf[String]).collect().toList
println(namesList)
List(Alice, Bob, Cathy)
Using Java
In Java, the process involves converting the DataFrame column to an RDD and then collecting the values.
import org.apache.spark.sql.*;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
public class ExtractColumnValues {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();
// Sample DataFrame
List<Row> data = Arrays.asList(RowFactory.create("Alice", 1), RowFactory.create("Bob", 2), RowFactory.create("Cathy", 3));
StructType schema = new StructType(new StructField[]{
new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
new StructField("Id", DataTypes.IntegerType, false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
// Extracting "Name" column values as a list
List<String> namesList = StreamSupport.stream(df.select("Name").toJavaRDD().spliterator(), false)
.map(r -> r.getString(0))
.collect(Collectors.toList());
System.out.println(namesList);
}
}
[Alice, Bob, Cathy]
In all the above examples, we first create a sample DataFrame and then extract the values of a specific column (“Name”) into a list. The method to achieve this involves converting the selected column into RDD and then collecting the values into a list using appropriate functions and methods depending on the language.