How to Determine the Length of an Array Column in Apache Spark?

Determining the length of an array column in Apache Spark can be achieved using built-in functions. The specific function we will use is `size`. In this explanation, I’ll walk you through an example using PySpark and Scala to showcase how you can determine the length of an array column in a DataFrame.

Using PySpark

First, let’s create a DataFrame with an array column:


from pyspark.sql import SparkSession
from pyspark.sql.functions import size

# Create Spark session
spark = SparkSession.builder.master("local").appName("ArrayLengthExample").getOrCreate()

# Sample Data
data = [
    ("John", [1, 2, 3]),
    ("Alice", [1, 2]),
    ("Bob", [5, 6, 7, 8])
]

# Create DataFrame
columns = ["name", "numbers"]
df = spark.createDataFrame(data, columns)

# Display original DataFrame
df.show()

The output of the DataFrame will be:


+-----+------------+
| name|     numbers|
+-----+------------+
| John|   [1, 2, 3]|
|Alice|      [1, 2]|
|  Bob|[5, 6, 7, 8]|
+-----+------------+

Now, to determine the length of the array column “numbers”, we can use the `size` function:


# Add a new column with the size of the array
df_with_length = df.withColumn("array_length", size(df["numbers"]))

# Display DataFrame with array length column
df_with_length.show()

The output will be:


+-----+------------+------------+
| name|     numbers|array_length|
+-----+------------+------------+
| John|   [1, 2, 3]|           3|
|Alice|      [1, 2]|           2|
|  Bob|[5, 6, 7, 8]|           4|
+-----+------------+------------+

Using Scala

Similarly, we can achieve the same result using Scala. Below is an equivalent example:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.size

// Create Spark session
val spark = SparkSession.builder.master("local").appName("ArrayLengthExample").getOrCreate()

// Sample Data
val data = Seq(
    ("John", Seq(1, 2, 3)),
    ("Alice", Seq(1, 2)),
    ("Bob", Seq(5, 6, 7, 8))
)

// Create DataFrame
import spark.implicits._
val df = data.toDF("name", "numbers")

// Display original DataFrame
df.show()

The output of the DataFrame will be:


+-----+------------+
| name|     numbers|
+-----+------------+
| John|   [1, 2, 3]|
|Alice|      [1, 2]|
|  Bob|[5, 6, 7, 8]|
+-----+------------+

Now, to determine the length of the array column “numbers”, we can use the `size` function:


// Add a new column with the size of the array
val dfWithLength = df.withColumn("array_length", size($"numbers"))

// Display DataFrame with array length column
dfWithLength.show()

The output will be:


+-----+------------+------------+
| name|     numbers|array_length|
+-----+------------+------------+
| John|   [1, 2, 3]|           3|
|Alice|      [1, 2]|           2|
|  Bob|[5, 6, 7, 8]|           4|
+-----+------------+------------+

In summary, the `size` function is used in both PySpark and Scala to determine the length of an array column in a DataFrame. As demonstrated, the function efficiently adds a new column showing the length of the arrays contained in the specified array column.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top