When working with Apache Spark using PySpark, it’s quite common to encounter scenarios where you need to convert a string type column into an array column. String columns that represent lists or collections of items can be split into arrays to facilitate the array-based operations provided by Spark SQL. In this guide, we will go through the process of converting a string to an array column in PySpark using various methods and functions.
Understanding PySpark DataFrames
Before proceeding with the transformation, it’s important to understand the basics of PySpark DataFrames. A DataFrame in PySpark is a distributed collection of data organized into named columns, and it is conceptually equivalent to a table in a relational database. DataFrames allow users to perform various operations like selection, filtering, and aggregation efficiently across large datasets.
Using the split Function
One of the simplest methods to convert a string to an array is by using the `split` function available in `pyspark.sql.functions`. This function splits the string around a specified delimiter and returns an array of substrings. Let’s look at an example of how to use this function.
Example: Split Comma-Separated String into Array
Suppose we have a DataFrame with a column ‘data’ containing comma-separated strings, and we want to split these by the comma to get an array.
from pyspark.sql import SparkSession
from pyspark.sql.functions import split
# Initialize Spark session
spark = SparkSession.builder.appName('StringToArray').getOrCreate()
# Data
data = [("James,John,Jenny,Jeniffer",),
("Michael,Robert,William,Willaimina",),
("Smith,Sanders,Peter,Parker",)]
# Create DataFrame
df = spark.createDataFrame(data, ["names"])
# Show the original DataFrame
df.show(truncate=False)
# Define the delimiter
delimiter = ","
# Split the 'names' column
df = df.withColumn("names_array", split(df["names"], delimiter))
# Show the transformed DataFrame
df.show(truncate=False)
# Stop the Spark session
spark.stop()
The output of the above code snippet will look like this:
+-------------------------------+
|names |
+-------------------------------+
|James,John,Jenny,Jeniffer |
|Michael,Robert,William,Willaimina|
|Smith,Sanders,Peter,Parker |
+-------------------------------+
+-------------------------------+----------------------------------------+
|names |names_array |
+-------------------------------+----------------------------------------+
|James,John,Jenny,Jeniffer |[James, John, Jenny, Jeniffer] |
|Michael,Robert,William,Willaimina|[Michael, Robert, William, Willaimina]|
|Smith,Sanders,Peter,Parker |[Smith, Sanders, Peter, Parker] |
+-------------------------------+----------------------------------------+
Handling Different Delimiters
The `split` function can work with different delimiters. For instance, if a column contains strings separated by semicolons instead of commas, you would simply change the delimiter in the function.
Example: Split Semicolon-Separated String into Array
Here’s a similar example that uses a semicolon as the delimiter.
# Assuming 'df' is a DataFrame with semicolon-separated strings in 'data' column
delimiter = ";"
# We can just change the delimiter to semicolon to split the strings
df = df.withColumn("names_array", split(df["names"], delimiter))
df.show(truncate=False)
# Output will vary based on the actual data
Using the explode Function
Another way to convert a string into an array is by using the `explode` function. While the `explode` function is primarily used to convert an array into rows, i.e., one row for each element in the array, it can be used in combination with `split` to achieve our goal.
Example: Exploding a Split String into Rows and Re-aggregating
Assume we want to split a comma-separated string and then explode it into separate rows.
from pyspark.sql.functions import explode
# Starting with the same DataFrame as before
df = spark.createDataFrame(data, ["names"])
# Split and explode the 'names' column
df_exploded = df.withColumn("name", explode(split(df["names"], ",")))
# Show the result
df_exploded.show(truncate=False)
# Output:
# +------------------------------+---------+
# |names |name |
# +------------------------------+---------+
# |James,John,Jenny,Jeniffer |James |
# |James,John,Jenny,Jeniffer |John |
# |James,John,Jenny,Jeniffer |Jenny |
# |James,John,Jenny,Jeniffer |Jeniffer |
# ...
The result will show each name as a separate row. This method isn’t always what you need when converting a string to an array, but in some specific cases, it can be quite handy.
Regular Expression Based Splitting
In some cases, you may have a more complex string pattern that requires regular expressions to split accurately. PySpark’s `split` function also supports regular expressions.
Example: Regex-Based Splitting
Suppose we want to split strings based on a pattern where a number is followed by one or more letters.
from pyspark.sql.functions import regexp_extract, col
# Example DataFrame with complex string pattern
data = [("1a 2b 3c","James"), ("4d 5e 6f","Michael")]
df_complex = spark.createDataFrame(data, ["pattern", "name"])
# Split the 'pattern' column using regular expression
df_complex = df_complex.withColumn("pattern_array", split(col("pattern"), "\\d+"))
df_complex.show(truncate=False)
# Stop the Spark session
spark.stop()
# The output will be:
# +--------+-------+----------------+
# |pattern |name |pattern_array |
# +--------+-------+----------------+
# |1a 2b 3c|James |[, a , b , c] |
# |4d 5e 6f|Michael|[, d , e , f] |
# +--------+-------+----------------+
The resulting DataFrame ‘df_complex’ will have a new column ‘pattern_array’ where the strings are split based on the regular expression provided.
Conclusion
Converting string columns to array columns in PySpark is a versatile operation that can be achieved using functions such as `split` and `explode`. Whether your strings are separated by a specific delimiter or you require a regular expression for more complex patterns, PySpark provides the necessary flexibility. Understanding how to perform such transformations allows you to fully leverage Spark’s powerful data manipulation capabilities.