Working with PySpark ArrayType Column: Examples

PySpark ArrayType Column : – One of the common data types used in PySpark is the ArrayType. This data type is useful when you need to work with columns that contain arrays (lists) of elements. In this guide, we will focus on working with ArrayType columns using PySpark, showcasing various operations and functions that can be performed on array columns in a DataFrame.

Contents hide

1 Understanding ArrayType in PySpark

1.1 Creating a DataFrame with ArrayType Column

2 Performing Operations on ArrayType Columns

2.1 Accessing Elements in an Array

2.2 Adding Elements to an Array

2.3 Exploding an Array

2.4 Filtering Based on Array Elements

3 Working with Complex Operations on ArrayType Columns

3.1 Sorting Arrays in a DataFrame

3.2 Concatenating Multiple Arrays

4 Conclusion

5 About Editorial Team

6 You Might Also Like:

Understanding ArrayType in PySpark

Before diving into examples, let’s first understand what ArrayType represents in PySpark. The ArrayType column can store multiple items of the same data type. This is analogous to Python’s list data structure, where you can store a sequence of items. Each element in an ArrayType column is an array, and the base type of this array needs to be specified when you create the column.

Creating a DataFrame with ArrayType Column

To create a DataFrame with an ArrayType column, you can use the PySpark SQL types module to define the schema. Here’s an example:


from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, StructType, StructField

# Initialize Spark session
spark = SparkSession.builder.appName("ArrayTypeExample").getOrCreate()

# Define a schema with ArrayType
schema = StructType([
    StructField("name", StringType(), True),
    StructField("languages", ArrayType(StringType()), True)
])

# Create a DataFrame with the schema
data = [("Alice", ["English", "Spanish"]),
        ("Bob", ["German", "French"]),
        ("Charlie", ["Tamil", "Telugu"])]

df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

Running the above code will produce the following output:


+-------+-------------------+
|   name|          languages|
+-------+-------------------+
|  Alice| [English, Spanish]|
|    Bob|   [German, French]|
|Charlie|    [Tamil, Telugu]|
+-------+-------------------+

Performing Operations on ArrayType Columns

Now that we have a DataFrame with an ArrayType column, we can perform different operations on this column such as accessing elements, adding or removing elements, and exploring array functions available in PySpark’s functions module (pyspark.sql.functions).

Accessing Elements in an Array

You can access elements in an array using the array subscript operator:


from pyspark.sql.functions import col

# Select the first language of each person
df.withColumn("first_language", col("languages")[0]).show()

The output will display the first language element for each row:


+-------+-------------------+--------------+
|   name|          languages|first_language|
+-------+-------------------+--------------+
|  Alice| [English, Spanish]|       English|
|    Bob|   [German, French]|        German|
|Charlie|    [Tamil, Telugu]|         Tamil|
+-------+-------------------+--------------+

Adding Elements to an Array

To add elements to an array, you can use the array_union function:


from pyspark.sql.functions import array_union

# Add a new language to the languages column for each person
updated_df = df.withColumn("languages", array_union(col("languages"), array(["Chinese"])))
updated_df.show()

The output adds “Chinese” to the languages array of each person:


+-------+-----------------------------+
|   name|                    languages|
+-------+-----------------------------+
|  Alice|  [English, Spanish, Chinese]|
|    Bob|    [German, French, Chinese]|
|Charlie|     [Tamil, Telugu, Chinese]|
+-------+-----------------------------+

Exploding an Array

Sometimes, you may want to “explode” an array into a new row for each element. This can be done using the explode function:


from pyspark.sql.functions import explode

# Explode the languages array into multiple rows
exploded_df = df.withColumn("language", explode(col("languages")))
exploded_df.show()

The output shows each language on a separate row:


+-------+-------------------+--------+
|   name|          languages|language|
+-------+-------------------+--------+
|  Alice| [English, Spanish]| English|
|  Alice| [English, Spanish]| Spanish|
|    Bob|   [German, French]|  German|
|    Bob|   [German, French]|  French|
|Charlie|    [Tamil, Telugu]|   Tamil|
|Charlie|    [Tamil, Telugu]|  Telugu|
+-------+-------------------+--------+

Filtering Based on Array Elements

You can also filter rows based on conditions that involve elements of the ArrayType column. Here’s an example of filtering rows where the “languages” array contains “English”:


from pyspark.sql.functions import array_contains

# Filter rows that contain "English" in the languages column
english_speakers_df = df.filter(array_contains(col("languages"), "English"))
english_speakers_df.show()

Only rows where “languages” contains “English” are displayed:


+-----+-------------------+
| name|          languages|
+-----+-------------------+
|Alice| [English, Spanish]|
+-----+-------------------+

Working with Complex Operations on ArrayType Columns

PySpark offers an extensive list of functions to work with ArrayType columns for more complex operations such as sorting, concatenating arrays, and collecting lists.

Sorting Arrays in a DataFrame

To sort arrays, you can use the array_sort function:


from pyspark.sql.functions import array_sort

# Sort the languages array for each row
sorted_df = df.withColumn("languages_sorted", array_sort(col("languages")))
sorted_df.show()

The languages arrays are sorted alphabetically:


+-------+-------------------+-------------------+
|   name|          languages|   languages_sorted|
+-------+-------------------+-------------------+
|  Alice| [English, Spanish]| [English, Spanish]|
|    Bob|   [German, French]|   [French, German]|
|Charlie|    [Tamil, Telugu]|    [Tamil, Telugu]|
+-------+-------------------+-------------------+

Concatenating Multiple Arrays

To concatenate multiple arrays into one, use the array function:


from pyspark.sql.functions import array

# Concatenate two arrays
df.withColumn("concatenated", array(col("languages"), array(["Japanese"]))).show()

The arrays are concatenated and the output will display the resulting array:


+-------+-------------------+-----------------------------+
|   name|          languages|                 concatenated|
+-------+-------------------+-----------------------------+
|  Alice| [English, Spanish]| [English, Spanish, Japanese]|
|    Bob|   [German, French]|   [German, French, Japanese]|
|Charlie|    [Tamil, Telugu]|    [Tamil, Telugu, Japanese]|
+-------+------------------+------------------------------+

Conclusion

In this comprehensive guide, we explored working with PySpark ArrayType columns in various ways, ranging from basic operations such as accessing elements and exploding arrays to more advanced tasks like filtering, sorting, and transforming arrays using built-in functions. PySpark’s seamless integration with complex data types like ArrayType enables developers to perform sophisticated data transformations and manipulations with ease. Whether you are preprocessing data for machine learning models, performing data wrangling tasks, or managing complex ETL pipelines, understanding how to effectively work with PySpark’s ArrayType columns is crucial for dealing with real-world data scenarios.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.