PySpark String to Array Column - Apache Spark Tutorial

When working with Apache Spark using PySpark, it’s quite common to encounter scenarios where you need to convert a string type column into an array column. String columns that represent lists or collections of items can be split into arrays to facilitate the array-based operations provided by Spark SQL. In this guide, we will go through the process of converting a string to an array column in PySpark using various methods and functions.

Contents hide

1 Understanding PySpark DataFrames

2 Using the split Function

2.1 Example: Split Comma-Separated String into Array

3 Handling Different Delimiters

3.1 Example: Split Semicolon-Separated String into Array

4 Using the explode Function

4.1 Example: Exploding a Split String into Rows and Re-aggregating

5 Regular Expression Based Splitting

5.1 Example: Regex-Based Splitting

6 Conclusion

7 About Editorial Team

8 You Might Also Like:

Understanding PySpark DataFrames

Before proceeding with the transformation, it’s important to understand the basics of PySpark DataFrames. A DataFrame in PySpark is a distributed collection of data organized into named columns, and it is conceptually equivalent to a table in a relational database. DataFrames allow users to perform various operations like selection, filtering, and aggregation efficiently across large datasets.

Using the split Function

One of the simplest methods to convert a string to an array is by using the `split` function available in `pyspark.sql.functions`. This function splits the string around a specified delimiter and returns an array of substrings. Let’s look at an example of how to use this function.

Example: Split Comma-Separated String into Array

Suppose we have a DataFrame with a column ‘data’ containing comma-separated strings, and we want to split these by the comma to get an array.


from pyspark.sql import SparkSession
from pyspark.sql.functions import split

# Initialize Spark session
spark = SparkSession.builder.appName('StringToArray').getOrCreate()

# Data
data = [("James,John,Jenny,Jeniffer",),
        ("Michael,Robert,William,Willaimina",),
        ("Smith,Sanders,Peter,Parker",)]

# Create DataFrame
df = spark.createDataFrame(data, ["names"])

# Show the original DataFrame
df.show(truncate=False)

# Define the delimiter
delimiter = ","

# Split the 'names' column
df = df.withColumn("names_array", split(df["names"], delimiter))

# Show the transformed DataFrame
df.show(truncate=False)

# Stop the Spark session
spark.stop()

The output of the above code snippet will look like this:


+-------------------------------+
|names                          |
+-------------------------------+
|James,John,Jenny,Jeniffer      |
|Michael,Robert,William,Willaimina|
|Smith,Sanders,Peter,Parker     |
+-------------------------------+

+-------------------------------+----------------------------------------+
|names                          |names_array                             |
+-------------------------------+----------------------------------------+
|James,John,Jenny,Jeniffer      |[James, John, Jenny, Jeniffer]          |
|Michael,Robert,William,Willaimina|[Michael, Robert, William, Willaimina]|
|Smith,Sanders,Peter,Parker     |[Smith, Sanders, Peter, Parker]         |
+-------------------------------+----------------------------------------+

Handling Different Delimiters

The `split` function can work with different delimiters. For instance, if a column contains strings separated by semicolons instead of commas, you would simply change the delimiter in the function.

Example: Split Semicolon-Separated String into Array

Here’s a similar example that uses a semicolon as the delimiter.


# Assuming 'df' is a DataFrame with semicolon-separated strings in 'data' column
delimiter = ";"

# We can just change the delimiter to semicolon to split the strings
df = df.withColumn("names_array", split(df["names"], delimiter))

df.show(truncate=False)

# Output will vary based on the actual data

Using the explode Function

Another way to convert a string into an array is by using the `explode` function. While the `explode` function is primarily used to convert an array into rows, i.e., one row for each element in the array, it can be used in combination with `split` to achieve our goal.

Example: Exploding a Split String into Rows and Re-aggregating

Assume we want to split a comma-separated string and then explode it into separate rows.

from pyspark.sql.functions import explode

# Starting with the same DataFrame as before
df = spark.createDataFrame(data, ["names"])

# Split and explode the 'names' column
df_exploded = df.withColumn("name", explode(split(df["names"], ",")))

# Show the result
df_exploded.show(truncate=False)

# Output:
# +------------------------------+---------+
# |names                         |name     |
# +------------------------------+---------+
# |James,John,Jenny,Jeniffer     |James    |
# |James,John,Jenny,Jeniffer     |John     |
# |James,John,Jenny,Jeniffer     |Jenny    |
# |James,John,Jenny,Jeniffer     |Jeniffer |
# ...

The result will show each name as a separate row. This method isn’t always what you need when converting a string to an array, but in some specific cases, it can be quite handy.

Regular Expression Based Splitting

In some cases, you may have a more complex string pattern that requires regular expressions to split accurately. PySpark’s `split` function also supports regular expressions.

Example: Regex-Based Splitting

Suppose we want to split strings based on a pattern where a number is followed by one or more letters.

from pyspark.sql.functions import regexp_extract, col

# Example DataFrame with complex string pattern
data = [("1a 2b 3c","James"), ("4d 5e 6f","Michael")]
df_complex = spark.createDataFrame(data, ["pattern", "name"])

# Split the 'pattern' column using regular expression
df_complex = df_complex.withColumn("pattern_array", split(col("pattern"), "\\d+"))

df_complex.show(truncate=False)

# Stop the Spark session
spark.stop()

# The output will be:
# +--------+-------+----------------+
# |pattern |name   |pattern_array   |
# +--------+-------+----------------+
# |1a 2b 3c|James  |[, a , b , c]   |
# |4d 5e 6f|Michael|[, d , e , f]   |
# +--------+-------+----------------+

The resulting DataFrame ‘df_complex’ will have a new column ‘pattern_array’ where the strings are split based on the regular expression provided.

Conclusion

Converting string columns to array columns in PySpark is a versatile operation that can be achieved using functions such as `split` and `explode`. Whether your strings are separated by a specific delimiter or you require a regular expression for more complex patterns, PySpark provides the necessary flexibility. Understanding how to perform such transformations allows you to fully leverage Spark’s powerful data manipulation capabilities.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.