How to Explode Multiple Columns in a PySpark DataFrame?

In PySpark, the `explode` function is commonly used to transform a column containing arrays or maps into multiple rows, where each array element or map key-value pair becomes its own row. When you need to explode multiple columns, you have to apply the `explode` operation on each column sequentially. Below is an example showcasing how to achieve this in PySpark using a DataFrame.

Exploding Multiple Columns in a PySpark DataFrame

Let’s start by creating a sample PySpark DataFrame that we will use for our example. Assume that our DataFrame has two columns, both containing arrays, and we want to explode both of these columns.

Step 1: Creating the Sample DataFrame


from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

# Create Spark session
spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()

# Sample data
data = [
    (1, ["a", "b", "c"], [1, 2, 3]),
    (2, ["d", "e"], [4, 5]),
    (3, ["f"], [6])
]

# Creating DataFrame
df = spark.createDataFrame(data, ["id", "letters", "numbers"])

df.show(truncate=False)

+---+---------+---------+
|id |letters  |numbers  |
+---+---------+---------+
|1  |[a, b, c]|[1, 2, 3]|
|2  |[d, e]   |[4, 5]   |
|3  |[f]      |[6]      |
+---+---------+---------+

Step 2: Exploding the Columns Sequentially

In this step, we will apply the `explode` function to both columns. Note that we need to explode each column sequentially in separate transformations.


# Exploding the 'letters' column
df_exploded_letters = df.withColumn("letter", explode("letters"))

df_exploded_letters.show(truncate=False)

+---+---------+---------+------+
|id |letters  |numbers  |letter|
+---+---------+---------+------+
|1  |[a, b, c]|[1, 2, 3]|a     |
|1  |[a, b, c]|[1, 2, 3]|b     |
|1  |[a, b, c]|[1, 2, 3]|c     |
|2  |[d, e]   |[4, 5]   |d     |
|2  |[d, e]   |[4, 5]   |e     |
|3  |[f]      |[6]      |f     |
+---+---------+---------+------+

# Exploding the 'numbers' column
df_exploded_both = df_exploded_letters.withColumn("number", explode("numbers"))

df_exploded_both.show(truncate=False)

+---+---------+---------+------+------+
|id |letters  |numbers  |letter|number|
+---+---------+---------+------+------+
|1  |[a, b, c]|[1, 2, 3]|a     |1     |
|1  |[a, b, c]|[1, 2, 3]|a     |2     |
|1  |[a, b, c]|[1, 2, 3]|a     |3     |
|1  |[a, b, c]|[1, 2, 3]|b     |1     |
|1  |[a, b, c]|[1, 2, 3]|b     |2     |
|1  |[a, b, c]|[1, 2, 3]|b     |3     |
|1  |[a, b, c]|[1, 2, 3]|c     |1     |
|1  |[a, b, c]|[1, 2, 3]|c     |2     |
|1  |[a, b, c]|[1, 2, 3]|c     |3     |
|2  |[d, e]   |[4, 5]   |d     |4     |
|2  |[d, e]   |[4, 5]   |d     |5     |
|2  |[d, e]   |[4, 5]   |e     |4     |
|2  |[d, e]   |[4, 5]   |e     |5     |
|3  |[f]      |[6]      |f     |6     |
+---+---------+---------+------+------+

As you can see, the resulting DataFrame `df_exploded_both` is created by exploding both the ‘letters’ and ‘numbers’ columns. Note that the number of rows in the final DataFrame is the product of the lengths of the input arrays.

Conclusion

Exploding multiple columns in a PySpark DataFrame involves sequentially applying the `explode` function to each column. By following the steps above, you can effectively transform array or map columns into multiple rows in your DataFrame.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top